Around a year ago on this blog I wrote:

“There are some ironic twists in the way psychology is dealing with its replication crisis that may well threaten even the most sincere efforts to put the field on firmer scientific footing”

That’s philosopher’s talk for “I see a rich source of problems that cry out for ministrations of philosophers of science and of statistics”. Yesterday, I began my talk at the Society for Philosophy and Psychology workshop on “Replication in the Sciences”with examples of two main philosophical tasks: to clarify concepts, and reveal inconsistencies, tensions and ironies surrounding methodological “discomforts” in scientific practice.

Example of aconceptual clarificationEditors of a journal,

Basic and Applied Social Psychology, announced they are banning statistical hypothesis testing because it is “invalid” (A puzzle about the latest “test ban”)It’s invalid because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of

H_{0}) (2015 Trafimow and Marks)

- Since the methodology of testing explicitly rejects the mode of inference they don’t supply, it would be incorrect to claim the methods were invalid.
- Simple conceptual job that philosophers are good at
(I don’t know if the group of eminent statisticians assigned to react to the “test ban” will bring up this point. I don’t think it includes any philosophers.)

____________________________________________________________________________________

Example of revealing inconsistencies and tensions

: It’s too easy to satisfy standard significance thresholdsCritic

: Why do replicationists find it so hard to achieve significance thresholds?You

: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPsCritic

: So, the replication researchers want methods that pick up on and block these biasing selection effects.You

Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference.________________________________________________________________

Whether this can be resolved or not is separate.

- We are constantly hearing of how the “reward structure” leads to taking advantage of researcher flexibility
- As philosophers, we can at least show how to hold their feet to the fire, and warn of the perils of accounts that bury the finagling

The philosopher is the curmudgeon (takes chutzpah!)

I also think it’s crucial for philosophers of science and statistics to show how to improve on and solve problems of methodology in scientific practice.

My slides are below; share comments.

There is nothing I disagree with in here (at least not at first sight…), but do you think there will ever be hope, regardless of the methodology used, that selection effects can be effectively controlled? Most researchers have too much freedom and too little incentive for controls, and there are many things (as Gelman argued in his “Garden of Forking Paths”) researchers do even unconsciously that involve some kind of selection bias. Certainly it is important to teach about these things, but still, as things stand, I’d probably never would grant any scientific result the status of a “discovery” that is not in fact replicated in a new study focused on confirming just this specific result.

A direction for “error statistical” replication research could be to investigate what can go wrong even then, or what the standards for this should be, etc.

This is really the point of the last section on austere self-criticism. Fields that do not make progress in methodology even after a considerable amount of such self-criticism are questionable sciences. I would agree about the need to “replicate”(as a test for the genuineness/generalizability of the effect), but a purely statistical replication of the sort now being pursued in psychology cannot reveal the most import problems in moving from H to H* (even allowing that H is a replicable effect).

I would like to propose a different approach (again) from a ‘customer’ of statistics!

As I understand it, the probability of replication depends on a number of facts about a study. These facts have to be used in combination to try to predict that all the causes of non-replication are low or zero (depending on the definitions used) so that therefore, the probability of replication is high. These probabilities can be estimated by using reasoning by probabilistic elimination, by showing that each cause of non-replication is improbable. The numerical result of a study (e.g. the mean and distribution of individual observations, or the observed proportion e.g. 75/96) is only one group of facts about the study but an important group, the others being evidence for how well the study was described, whether there was evidence of data dredging, omission of disappointing results etc.

Many of the probabilities and likelihoods required to make these estimates are based on highly subjective estimates, so that inevitably, estimating that the probability of replication is high will be contentious. However, there may be greater general agreement when estimating that the probability of replication is low! This would be easier because there need only be general agreement that there is at least one serious flaw in the study. However in order to agree that the probability of replication is high, it would be necessary to agree that there are no flaws or only minor flaws out of a large number of possible flaws, for which it would be more difficult to get a consensus.

So, if there is failure to show that the probability of ANY of these causes of non-replication is low, the final probability of replication cannot be high. The order in which the facts are used in the reasoning or calculation process is immaterial. The most straightforward fact about a study will be the numerical result and it would be reasonable to examine this first. If the probability of non-replication due to the numerical result is not low, then there is no point in continuing with less straight-forward facts as these cannot raise the probability of replication of the study further. (Invoking other studies or anecdotes that might raise the probability of replication with meta-analysis or in a subjective Bayesian way would in effect create a different study result with a different probability of replication.)

In general, the probability of an observed result or something more extreme given a null hypothesis is approximately the same (or the exactly same with some assumptions) as the probability of the null hypothesis or something more extreme given the observed result. This would also apply to repeating the study with the same number of observations to try to ‘replicate’ it. However, a result barely greater than ‘the null hypothesis’ in the direction of the ‘observed result’ would not be regarded by many as ‘replicating’ the observed result. If the interval of replication was narrowed so that it was at least one SEM greater than the null hypothesis (e.g. that was 2 SEMs away from the observed result) then the probability of a repeat study result falling within the ‘replication interval’ would be 0.84. If the probability of replication of one SEM above the null hypothesis is to be 0.976 (corresponding to 2 SEMs), then the one-sided P for the study would have to be 0.00135 based on a null hypothesis that is 3 standard errors away from the observed result.

It is necessary therefore to identify first the range of results that would be regarded as replicating a study. This would be analogous to a confidence interval or Bayesian credibility interval. If the probability of replicating the study result within this interval given the numerical result or data alone is already going to be below what is desired after examining the data, then the study could be rejected immediately as being too unreliable to be used for further inferences. However, if the probability of replication given the data alone was high (e.g. 0.975) then the remainder of the paper could be examined. If the probability of non-replication due to other causes given all the relevant facts about the work were zero, then the probability of replication given these facts and all the numerical results would remain 0.975. However, if all the other probabilities were not zero, then the estimated probability of replication could be calculated. All the above probability estimates can be calculated by using the ‘probabilistic elimination theorem’ originally arrived at to model reasoning when estimating the probability of replicating symptoms, etc. (the aspect that deals with possible ‘error’) and differential diagnostic reasoning (or hypothesis testing) in medicine. This is explained in detail in the Oxford Handbook of Clinical Diagnosis, Chapter 13.

If a prior study or prior ‘subjective’ (non-transparent) impressions are to be taken into account in a Bayesian fashion, then the study result to be considered will become a combination of the current result with other study results to form a larger combined sample. This larger sample would have a new mean, a new distribution and a new number of observations. However, this pooling of samples may only be valid in the eyes of many if all the studies had passed all the tests of replication (e.g. acceptable study designs, accurate records, etc.). It would be difficult therefore for any subjective non-transparent impression to satisfy these criteria. So I agree with Christian Hennig, all interesting studies should be repeated to see if they are replicated. However a new probability of replicating the combined result arrived at by meta-analysis could then be calculated (and if necessary, further studies done).

I don’t think probability of replication plays a role; I agree with Senn that this is a wrong-headed consideration to begin with.

Could you please give me the reference for Stephen Senn’s reasoning (and yours) about this so that I can understand it.

https://errorstatistics.com/2015/05/09/stephen-senn-double-jeopardy-judge-jeffreys-upholds-the-law-guest-post/

https://errorstatistics.com/2015/03/16/stephen-senn-the-pathetic-p-value-guest-post/

Excerpts from Senn’s letter

https://errorstatistics.com/2012/05/10/excerpts-from-s-senns-letter-on-replication-p-values-and-evidence/

Thank you for that. I shall assume that Stephen Senns’ objection to the concept of the ‘probability of replication’ is crystallised in the quote from his 2002 letter to the editor of Statistics in Medicine in response to S. Goodman (mentioned in your third reference):

“It would be absurd if our inferences about the world, having just completed a clinical trial, were necessarily dependent on assuming the following. 1. We are now going to repeat this experiment. 2. We are going to repeat it only once. 3. It must be exactly the same size as the experiment we have just run. 4. The inferential meaning of the experiment we have just run is the extent to which it predicts this second experiment.”

I agree with what Stephen Senn says here. However, IF he had preceded the first three sentences with ‘IF’ then sentence 4 could be replaced with a sentence that: “The probability of replicating the result could be used as a measure of the reliability of the study that can be calculated using familiar statistical assumptions”. Unlike a ‘P’ value, the ‘probability of replication’ is a familiar human concept that is used (as well as ‘corroboration’) when we assess the reliability of what other people report.

If a study were actually to be repeated with the same number of subjects (or more or fewer) then the new data could be combined with those in the previous study and a new probability of replication calculated for the same replication interval. The results of the second study cannot be used to assess whether the first was ‘correct’ or vice versa. If they contradicted each other then the probability of subsequent replication would be lower. If they did concur, then the probability of subsequent replication would be even higher. Perhaps one way of testing the validity of probabilities of replication might be to calibrate them against the frequency of actual replication within some result interval when many studies were repeated in exactly the same way.

A high estimated probability of future replication would simply provide a degree of confidence that the result was reliable enough to be used to make other interesting inferences or to test various hypotheses about genetics or physiology or the distribution of the result in larger populations or different populations or to plan further studies.

I cannot bring myself to appreciate a probability of replication measure. It seems that the situation is that the phenomenon tested for exists or it doesn’t, and if a study is not replicated in an attempt, it could be because the phenomenon does not exist, but it could also be because the second study was not performed the same way or that there was a failure due entirely to chance. Likewise, a test with low severity that seems to support the existence of some phenomenon might be readily replicable and be doubly misleading following the second experiment that appears to confirm the first. It seems we tend to underestimate the challenge of performing a test that someone else designed and performed, exactly as they did in every relevant way. To get get a probability of replication measure do you not have to assume away the most important aspects of trying to replicate?

I think the issue of a prob of a replication (the meanings of which Senn delineates well) isn’t directly connected to the issue of an exact replication. Obviously the replication is not exact, and in many cases you wouldn’t want it to be. Recall how Potti declared that Baggerly and Coombes were wrong to claim they couldn’t replicate him because they did not repeat his tricky method of failing to separate training and test data. Potti said, when they do what I did, they successfully replicate me (see Potti post, second of three). In other words, to start with, there’s a judgment of what counts as an illicit way to generate a result. Likewise if a stat sig effect resulted from cherry picking or trying and trying again, one would not consider that the replicator ought to attempt to repeat it in that illicit manner.

I can see that it is not helpful to repeat flawed analytical procedures, but I am referring to the more mundane aspects such as observing, scoring, and measuring. For some studies, there is notable challenge in these aspects. If observations are not made in a consistent manner it makes little difference how you analyze the data. The probability of replication assumes all is good in this area, right?

Unrelated but what do you think of this (response by Forster and Co. to a “new” analysis of their papers?

No John, that is not right (see the final sentence of the first paragraph of my above comment on the 6th June and the subsequent paragraphs). The probability (p) of replication within an interval (RI) is conditional (/) upon many individual items of evidence (i.e. p(RI/E1∩E2∩ … ∩En). The results section of a paper (which is also used to calculate ‘P’ values) is only one item evidence (e.g. E1) out of the total conditional evidence (E1∩E2∩ … ∩En). Unless the initial probability of replication conditional on the results section alone is high (or the ‘P value’ low), then the prospect of getting a high probability of replication will have failed at the first hurdle (like failing to throw a six to get started in a board game). The other evidence would be the methods used (E2∩E3∩ etc.). If the methods were prone to bias, this would lower the probability of replication further. Further evidence to be taken into account would be differences in the author’s and reader’s settings e.g. racial mix of patients (Ei, E‘i+1’, etc.). Now these considerations apply to every situation to which a ‘P’ value or Bayesian calculation are applied but it is often left unsaid. I make it explicit when estimating the probability of replication by trying to put the statistical calculations in context of the other factors.

I agree with you Professor Mayo. You confirm much of what I say. The probability (p) of replication within some result interval (RI) is not only conditional (/) on the numbers in the results section (e.g. E1) but also on the reliability of the methods and the accuracy with which they are applied (e.g. E2, E3, etc.). There may also be unavoidable differences between the subjects and techniques in the settings of the author and the reader. For example, there may be ‘tricky methods’ that author cannot repeat, or the racial mix of in the author’s and reader’s settings may differ, etc. (e.g. Ei, E‘i+1’, etc.). It is the reader’s responsibility to estimate how this other conditional evidence will affect the probability of replication in a different setting (as well as in the original setting). It may not be possible or sensible to repeat the study exactly (so that (p(RI/E1∩E2∩ … ∩En) is low). However, if the original study is interesting, the reader might try to do something analogous, which would be a new study. An editor will try to judge the relevance of the original study to the majority of the journal’s readers (based on the transferability its subjects and methods).

But these objections apply to any probability, including ‘P’ values and Bayesian probability estimates. They too depend on the study being conducted in accordance to generally accepted scientific conventions, i.e. in a way that other people can share, otherwise the ’knowledge’ described is of no value to anyone else and lies outside the scientific enterprise. It is of course possible that the result of a repeated study lies within the predicted replication interval (or outside it). What happens in practice is that if the second experiment does replicate the result of the first within the interval, then a DECISION is often made to build upon it by doing more work. This is because the probability of subsequent replication is even higher. However, there is no guarantee that the resulting line of research will lead to a successful outcome, because as you say, an unlikely chance or unwitting bias may have led to perverse results in either the first or second study or both. However, estimating the subsequent probability of replication may be regarded as an estimate of the probability that things will go well in future (or not, as 1-p).

You have a real example of an application?

In response to your comment John of the 9th June: Every scientist and critical reader that I have known has read papers by at least intuitively considering the probability of replication (and planned studies carefully and written papers with this in mind). Generally speaking most methods read well and are taken on trust, so that the subjective probability of non-replication due to poor methods is low or zero, and provided ‘P’ is very low, I would regard the probability of replication as being adequately high. However, suspicion of dishonesty would severely lower the probability of replication. It is of course very toxic because it would also lower the probability of replication (i.e. the credibility) of other work linked to the author.

Professor Mayo’s question of the 8th June about Professor Förster’s work provides you with a perfect example. The ‘P’ values and thus the p(RI/data alone) for his work were fine but the pattern of results aroused suspicion of fabrication and dishonesty. When asked to show the raw data records, he claimed that they had been destroyed because he had moved to a smaller office. This was regarded as being scientifically improper. Because of this destruction of records, he could not be found guilty or innocent of fraud, this more serious accusation being ‘unproven’. The relevant papers were retracted because of suspicion (not proof) of fabrication (that lowered the probability of replication). Because he was not found guilty of fraud, he was able to move to work in another university. Presumably, he will keep careful records, not lose them, and be prepared to show them on demand to readers in future.

Gelman recently had a very short post and comment, see here: http://andrewgelman.com/2015/06/04/a-quick-one/

“Fabio Rojas asks:

Should I do Bonferroni adjustments? Pros? Cons? Do you have a blog post on this? Most social scientists don’t seem to be aware of this issue.

My [Gelman] short answer is that if you’re fitting mutlilevel models, I don’t think you need multiple comparisons adjustments;”

and then in the comments Gelman:

“You’re not listening! The correct answer to “what should we multiply our p-value by,” is: Don’t summarize your inferences with p-values! Fit a multilevel model and these problems go away. No pre-specifying and reporting required.”

Since Gelman has come out as an “Error Statistician” I was curious whether you supported his viewpoint on multiple testing.

I don’t really understand his view on multiple testing–the multi-level modeling seems to change the problem. Any way, it is only one example of a data dependent selection.