Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results

I generally find National Academy of Science (NAS) manifestos highly informative. I only gave a quick reading to around 3/4 of this one. I thank Hilda Bastian for twittering the link. Before giving my impressions, I’m interested to hear what readers think, whenever you get around to having a look. Here’s from the intro*:

Questions about the reproducibility of scientific research have been raised in numerous settings and have gained visibility through several high-profile journal and popular press articles. Quantitative issues contributing to reproducibility challenges have been considered (including improper data management and analysis, inadequate statistical expertise, and incomplete data, among others), but there is no clear consensus on how best to approach or to minimize these problems…

A lack of reproducibility of scientific results has created some distrust in scientific findings among the general public, scientists, funding agencies, and industries. For example, the pharmaceutical and biotechnology industries depend on the validity of published findings from academic investigators prior to initiating programs to develop new diagnostic and therapeutic agents that benefit cancer patients. But that validity has come into question recently as investigators from companies have noted poor reproducibility of published results from academic laboratories, which limits the ability to transfer findings from the laboratory to the clinic (Mobley et al., 2013).

While studies fail for a variety of reasons, many factors contribute to the lack of perfect reproducibility, including insufficient training in experimental design, misaligned incentives for publication and the implications for university tenure, intentional manipulation, poor data management and analysis, and inadequate instances of statistical inference. The workshop summarized in this report was designed not to address the social and experimental challenges but instead to focus on the latter issues of improper data management and analysis, inadequate statistical expertise, incomplete data, and difficulties applying sound statistical inference to the available data.

As part of its core support of the Committee on Applied and Theoretical Statistics (CATS), the National Science Foundation (NSF) Division of Mathematical Sciences requested that CATS hold a workshop on a topic of particular importance to the mathematical and statistical community. CATS selected the topic of statistical challenges in assessing and fostering the reproducibility of scientific results.

WORKSHOP OVERVIEW

On February 26-27, 2015, the National Academies of Sciences, Engineering, and Medicine convened a workshop of experts from diverse communities to examine this topic. Many efforts have emerged over recent years to draw attention to and improve reproducibility of scientific work. This workshop uniquely focused on the statistical perspective of three issues: the extent of reproducibility, the causes of reproducibility failures, and the potential remedies for these failures. …

The workshop, sponsored by NSF, was held at the National Academy of Sciences building in Washington, D.C. Approximately 75 people, including speakers, members of the planning committee and CATS, invited guests, and members of the public, participated in the 2-day workshop. The workshop was also webcast live to nearly 300 online participants. This report has been prepared by the workshop rapporteur as a factual summary of what occurred at the workshop. The planning committee’s role was limited to organizing and convening the workshop. The views contained in the report are those of individual workshop participants and do not necessarily represent the views of all workshop participants, the planning committee, or the National Academies of Sciences, Engineering, and Medicine.

In addition to the summary provided here, materials related to the workshop can be found online at the website of the Board on Mathematical Sciences and Their Applications (http://www.nas.edu/bmsa), including the agenda, speaker presentations, archived webcasts of the presentations and discussions, and other background materials.

You can read the full report here.

Share your thoughts.

*By the way, my favorite quote from Fisher (vs isolated significant results) was stated in full by two of the speakers (let me know if you find more than 2).

Msc remarks.

First, I’d like it if Stephen Senn would look at how one of the contributors computes replication probabilities on p.49, and lets us know what he thinks.

I noticed a reference to p.50 Young and Karr (2011)–Stan Young being a contributor to this blog.

p. 77 There are many instances in which

analyses are done prematurely against the advice of statisticians, and researchers

shift outcomes or fail to define outcomes adequately at the onset so as to look for

an outcome that produces a significant result. The participant noted that it is hard

to resist the pressure because statisticians usually work for the investigator and an

investigator can look for other statisticians whose recommended adjustments are

less burdensome.

This reminds me of the ratings agencies in the movie the Big Fail–each was under pressure to puff up the ratings of junk, else junk owners would just go to other, more lenient ratings agencies.

p.79 Baggerly gives the best idea I’ve seen so far in this report: The Popperian requirement that researchers stipulate in advance

“what results would

indicate that the treatment resulted in no significant differences”.

Compare the topics/treatments of this workshop with my recent post on Goldacre’s complaints: people/journals saying there’s nothing wrong with what they’re doing.

https://errorstatistics.com/2016/02/03/philosophy-laden-meta-statistics-is-the-new-technical-activism-free-of-statistical-philosophy/

>”The Popperian requirement that researchers stipulate in advance “what results would indicate that the treatment resulted in no significant differences”.”

The Popperian requirement would be stipulate in advance what results would indicate the evidence is inconsistent with the researcher’s theory. This is != “significant differences”, which are usually the opposite of what is predicted by the theory. That is also what the researchers and audience actually care about.

The lack of even a genuine effect would block the inference entailed by the theory. The presumption is that the theory entails a genuine statistical effect. In a context where the concern is that it’s too easy to find a way to interpret the data as evidence of a genuine effect, this kind of prespecification is something.

What I mean is that if the researcher’s theory entails mean1 != mean2 (or more often mean1>mean2) and that is what we observe, this would be confirmation not falsification. Wasn’t it Popper’s position that this is not valid reasoning?

I’d think he would want the researcher to severely test the theory they believe, rather than a different theory they hope is false (the one predicting no genuine effect).

Yes, affirming the consequent is deductively invalid, even though e gives a boost in probability to H when H entails e. I’ll just repeat what I said before, at least if a researcher says in advance what would count as failing to show an effect, he or she cannot then take that negative result and interpret it in such a way that it has shown an effect. So I thought that was a sensible suggestion. Having done so (specified up front what will be accepted as failure to even find a genuine effect), and finding an effect entailed by a theory, surely does not warrant inferring the theory. It’s just better than allowing total flexibility post-data.

If you look up Popper on this blog, you can see where I think some standard views about him are incorrect.

Anon: Looking at your remark again, it should be clarified that what’s being prespecified is what would NOT count as being in accord with a theory or hypothesis. Those are results that wouldn’t even permit rejecting the null. Admittedly, rejecting the null doesn’t warrant affirming the theory, but this serves as a clear check that not just any outcome is going to be interpreted as a win for the pet theory. I hope that helps.

Anoneuoid:

If the researcher’s theory entails ( mean1 > mean2 ) and the researcher’s theory is false, what then is the state of affairs?

Can we agree that falsifying the researcher’s theory leaves us with ( mean1 mean2 ) is false?

In order for a researcher to severely test the theory he/she believes, do we not need some sense of the state of affairs should the researcher’s theory of belief be incorrect?

This will likely be a very unpopular comment.

Look at all the things coming from industry. Cell phones. Cars. Food. Etc. The list is essentially endless. Mostly, all those things work well. We consumers vote with dollars and industry provides things we like and need. Industry knows that if the customer is not satisfied, they are out of business. Think Darwin.

Now think of a university research effort. The product is a paper. The validity of a claim in a paper is entirely secondary. There is essentially no oversight within the university. We really can not depend on peer review. The peers do not ask for the data set, for example. The authors do not make their data sets public (that is changing). If a claim in a paper is wrong, there are few if any consequences for the authors. Mostly, I think, poor reproducibity is coming from universities, observational and amazingly experimental studies.

And if a problem is identified, most journals have no real way to deal with it. Editors do not like to be shown to have made a bad decision. Authors almost always resist. Readers have more important things to do than fight through the criticism process. If paper is demonstrably wrong, almost always it is not withdrawn. See recent paper in Nature by David Allison on how to deal with errors.

Science, we have a problem.

PS: Feinstein, Science, 1988, said researchers should state their hypothesis before the research is started. It did not happen. Nothing has changed in applied areas of science at universities. If science is not honored in industry, the company fails.

PPS: RCTs are largely ok. University stat people mostly are not dealing with applied research claims. The papers where claims are failing use a lot of statistics. We statisticians run a risk that we will be tarred by same brush as researchers doing whatever they can get away with.

Stan: Great to hear from you. For readers, I’m linking to a few earlier posts by Stan Young at the bottom of this comment.

Your point seems to be that the penalties for sloppy research and lack of reproducibility is largely absent in academia and academic journals, and if they could somehow be more like industry, then they too would straighten up and fly right? Is that it? But surely the medical discoveries have payoffs and penalties. Granted, the Potti case set a horrible and unforgivable example of letting blatantly bad statistical practices go largely unpunished.So maybe we need harsher sticks compared to the shiny carrots.

So please explain more about how you think we can fix or improve things.

I’m going to give my impressions shortly.

SOME YOUNG POSTS:

-When Young spoke to a seminar I ran with Aris Spanos: Reliability and reproducibility: fraudulent p-values through multiple testing

https://errorstatistics.com/2014/04/26/reliability-and-reproducibility-fraudulent-p-values-through-multiple-testing-and-other-biases-s-stanley-young-phil-6334-day13/

-S. Stanley Young: more trouble with trouble in the lab

https://errorstatistics.com/2013/11/16/s-stanley-young-more-trouble-with-trouble-in-the-lab-s-stanley-young-guest-post/

-Scientific integrity and transparency

https://errorstatistics.com/2013/03/11/s-stanley-young-scientific-integrity-and-transparency/

Making products that work is not the same as understanding why things work. My experience of academia vs industry is unsurprisingly that industry is better at the former and academia at the latter. Half the reason academia is/appears to be getting worse is treating the former like the latter, hence ‘papers as the product’. So if anything we need to push back against industry-fying academia Imo. The standard of computational and statistical modeling can be shocking in industry.

Let me add the following on multiple testing/p-hacking:

There is nothing inherently wrong with a p-value. It is a translation of signal to noise ratio to a zero to one scale. Who could object to trying to understand signal in the presence of noise? A researcher might legitimately have many questions from a data set. It is fine to examine multiple questions. In reporting the results of research, the researcher should inform the reader what was done, including the examination of multiple questions. Now let’s consider point of view. Mostly we are consumers of science. As a consumer I want to be able to evaluate claims made in a research paper. I need the protocol, if there was one. I need the analysis data set or it has to be available to a trusted 3rd party. I need the analysis code. Having protocol, data and analysis code is the essence of “reproducible research”. I would like the researcher to report all p-values computed on questions examined. I can post process the p-values using a p-value plot or some multiple testing adjustment. In fact, the researcher should do some sort of global evaluation of multiple p-values to save me effort. The researcher can report raw and adjusted p-values and let the consumer into the evaluation process. The researcher can explore in a random sample of the data and confirm or not in the remaining data.

I am suspicious of any claim made from an observational study if there is no protocol, no access to the analysis data set, no access to the analysis code and now no reporting of p-values for questions examined.

One of the most crippling issues in all the misunderstandings of statistical findings is not specifying differences of scientific/biological/medical relevance before analyzing confirmatory data.

Marcia McNutt’s recommendation for discussion by authors about

“Power analysis for how many samples are required to resolve the identified effect”

should read

“Power analysis for how many samples are required to resolve the pre-specified effect of scientific relevance”

It’s not always easy to identify a difference or other measure of effect size that means something relevant in the context of the problem under investigation, but it is essential to do in order to conduct reasonable statistical evaluation.

Investigators must be quizzed by statisticians so as to identify what kinds of differences mean something. Such differences must be pre-specified for any serious randomized clinical trial or other scientific evaluation, and tests for such pre-specified differences must be included in the final analysis. This is the issue that Ben Goldacre is rightly demanding in his Nature paper discussion, and that Mayo codifies in severity philosophy.

With such identified differences of scientific relevance and some previous exploratory data, one can then begin to assess how many samples will be required in future studies to ensure that differences of scientific relevance will result in a rejection of the null hypothesis with high frequency in confirmatory studies.

Insisting on this all too often forgotten essential component of a reasonable statistical evaluation will do much to improve the current debacle yielding mostly false findings in the scientific literature. Forcing scientists to identify meaningful degrees of difference will help clarify research goals. Small p-values alone are not meaningful if there is no discussion of pre-specified relevant effect sizes.

Well, I guess it is unreasonable to convince statisticians that rejecting a default nil null hypothesis is not interesting scientifically, nor researchers relying on this procedure that a p-value is not a metric of the evidence in favor/against their research hypothesis. It seems that combination of insights cannot be easily learned without rather extensive personal experience on both data collection and analysis sides.

Instead, here is a comment by Fisher on replication that I find more interesting than the one usually quoted:

“The confidence to be placed in a result depends not only on the magnitude of the mean value obtained, but equally on the agreement between parallel experiments. Thus, if in an agricultural experiment a first trial shows an apparent advantage of 8 bushels to the acre, and a duplicate experiment shows an advantage of 9 bushels, we have n = I, t = 17, and the results would justify some confidence that a real effect had been observed; but if the second experiment had shown an apparent advantage of 18 bushels, although the mean is now higher, we should place not more but less confidence in the conclusion that the treatment was beneficial, for t has fallen to 2·6, a value which for 1t = I is often exceeded by chance. The apparent paradox may be explained by pointing out that the difference of 10 bushels between the experiments indicates the existence of uncontrolled circumstances so influential that in both cases the apparent benefit may be due to chance, whereas in the former case the relatively close agreement of the results suggests that the uncontrolled factors are not so very influential. Much of the advantage of further replication lies in the fact that with duplicates our estimate of the importance of the uncontrolled factors is so extremely hazardous.”

Fisher, R. A. Statistical methods for research workers. (5th ed.) London: Oliver and

Boyd, 1934.(Pages 123-124) http://www.haghish.com/resources/materials/Statistical_Methods_for_Research_Workers.pdf

Anoneuoid:

Fisher’s work is totally infused with Assessing and Fostering the Reproducibility of Scientific Results aka meta-analysis or analysis of repeated studies – but this seems to have been largely ignored.

“was fairly central in Fisher’s work. This is apparently somewhat surprising even to some well known scholars of Fisher [private communication AWF Edwards] and this insight may aid those who try to understand Fisher’s work.”

I collected bits of these in my thesis https://phaneron0.files.wordpress.com/2015/08/thesisreprint.pdf but missed the entry you identified.

Keith O’Rourke

Anon: Thanks so much for the link. Many people dismiss his agriculture discussions as too agi* but the logic is very general.

*Gigerenzer says they had the air of manure about them.

Anoneuoid:

Yes, I noticed that the rejection of the null hypothesis test for LIGO detected signals for gravitational waves

“This corresponds to a probability < 2 x 10-6 of observing one or more noise events as strong as GW150914 during the analysis time, equivalent to 4.6σ."

was indeed scientifically uninteresting, and hardly noticed by anyone, this statistician excluded. As you declare, such tests are not interesting scientifically, and this metric indeed provides no evidence in favor/against their research hypothesis. Apparently, gravitational waves await further valid statistical assessment. Do let us know when an analysis of which you approve has been published.

Unnoticed reference:

PHYSICAL REVIEW LETTERS week ending 12 FEBRUARY 2016

PRL 116, 061102 (2016)

Observation of Gravitational Waves from a Binary Black Hole Merger

B. P. Abbott et al. (LIGO Scientific Collaboration and Virgo Collaboration)

(Received 21 January 2016; published 11 February 2016)

Steven: Thanks much for this citation, I was planning to look some up in relation to gravitational waves

Steven McKinney:

I read the paper when it came out and was interested in why they wrote so much about the p-value. I got an opportunity to ask a member of the LIGO collaboration about it on Gelman’s blog.[1] I found at least one person involved agreed it was a poor decision to use that statistic. According to him, it was only used because that is standard operating procedure (ie everyone does it), not because of any real merit or usefulness.

I believe that discovery was based on the ability to fit a precise theoretical model to the signal, and indeed the p-value is not very interesting. In fact it is misleading, because the prior expected rate of BH-BH mergers was also rather low. The probability the signal was due to chance may be multiple orders of magnitude higher than the p-value if we assume those are the only two possible explanations (the other important aspect is that LIGO claims to have ruled out every other conceivable explanation).

[1] http://andrewgelman.com/2016/02/15/the-recent-black-hole-ligo-experiment-used-pystan/

Anoneuoid:

From the Gelman blog posting referenced:

“I personally cannot stand FAP as conceived as a detection threshold statistic for a variety of reasons but it seems to work . . . No doubt there are probably better ways to do this but the SOP currently in the field is what you see in the detection paper. The supplementary papers are filled with data quality consistency checks and Bayes parameter estimation runs, so don’t think this discovery rests solely on a single statistic.”

It’s always great to hear that a discovery does not rest solely on a single statistic. We are indeed lucky to be far enough in to the modern age of statistics that multiple methods are available to assess such phenomena as gravitational waves and Higg’s bosons.

I will not denounce the Bayesian methods employed – I remain unclear on why people find it so fashionable to denounce error statistical methods. Personal opinions about particular statistical methods should not be confused with philosophical, mathematical and scientific evidence as to their effectiveness.

Curiously the commenter who can not stand FAP at least recognizes that “it seems to work”.

I’ll also note that many Bayesian methods “seem to work”. Some of them are apparently becoming SOP in the field.

It’s astounding what we are learning about our universe with all of these inference methods that seem to work.

> “I remain unclear on why people find it so fashionable to denounce error statistical methods.”

I don’t. Use them to test a real hypothesis, not a default nil null hypothesis and that is probably fine. The p-value has other uses as a summary statistic (eg as recommended by Michael Lew), or a computationally cheap way to rank “signals” in order to focus on those that may be most interesting (what the commenter was referring to regarding LIGO). Don’t confuse these uses for p-values with NHST (hereafter used to refer to NHST with the default nil null hypothesis) .

>”It’s astounding what we are learning about our universe with all of these inference methods that seem to work.”

First, many claims of “learning” something are based on NHST to begin with. You can’t use NHST to prove NHST works. So without knowing specifically what you refer to, I would probably disagree. My background is in medicine. In that area, it is becoming clearer and clearer (to people in general; I have known this for awhile now) that the vast majority of what has been claimed the last half century will have to be reassessed by future generations. The simple fact is that “chance” is the least of our concerns, there are always multiple mundane reasons for a deviation from the “null hypothesis”. Any study designed in a way that is capable of distinguishing between two real explanations for such a deviation will also be able to rule out chance. Testing the hypothesis of no difference is redundant in the case of properly designed research.

Secondly, in other cases it is just irrelevant. People put it in their papers because everyone else does it. I know this, because I can read papers and completely ignore the asides about “p=.0123” that litter them without losing anything. A funny thing is that even if I believed rejecting the null hypothesis was interesting, I would still need to read papers in this manner due to the rampant “misuses” (p-hacking, etc).

Thirdly, there are rare times where the default hypothesis happens to correspond to a plausible hypothesis it is worth ruling out. For example in studying ESP, where we really do expect zero difference between groups of people in a properly designed experiment. However, it seems people don’t accept NHST-based claims in that case anyway.

Stephen:

>It’s astounding what we are learning about our universe with all of these inference methods that seem to work.

May seem that way – but there is no alternative – methods seem to work and we continually re-evaluate them by trying harder to assess if they do seem to work (Ramsay put it roughly as induction can only be evaluated by induction.)

All in all, we can only struggle and hope to get less wrong as we continue to inquire.

Phan: Don’t forget that we do often arrive at substantive theories and causal understanding. We’d better understand why a method works, and when it doesn’t. If researchers are engaged in a full research program, results build and triangulate, researchers get better at checking themselves in handling a domain. They should publish explanations of the wrong way to go about studying something, and why. My puzzlement at the NAS report , and similar ones, is the suggestion that researchers are just jumping in and out of isolated studies, publishing a paper and moving to a distinct area. I would be surprised to find that’s how researchers really function.

Maybe some of these studies should focus on success stories. How knowledge results from immersion in a research problem over time. So this is one of my reactions to the report that I said I’d withhold until hearing from readers.

> My puzzlement at the NAS report , and similar ones, is the suggestion that researchers are just jumping in and out of isolated studies, publishing a paper and moving to a distinct area.

I believe, more often than not, that is the case in clinical research (what I found _original_ in Gelman’s work was that he kept revisiting the same topic/question with newer methods and data – I never got to do that.)

Keith O’Rourke

phaneron0:

You are right, of course there is no alternative.

All we can do as a large tribe of babbling monkeys on a large ball of dirt hurtling through spacetime is occasionally come to a consensus about what is real and what isn’t.

So indeed all we can do is identify inference methods that seem to work, and continually re-evaluate them. That is the great point of this philosophical blog, and the other excellent work done by philosophers such as Mayo. We have to keep banging away at the methods that we feel are useful tools in understanding truth.

How do we know anything? That’s the whole point of human philosophical discussions. Mayo’s philosophy makes sense to me, that’s all any of us can ever say.

What doesn’t make sense to me is knee-jerk rejection of a body of mathematical tools with a reasonable philosophical backing, without any sound argument supporting such rejection.

Anoneuoid posits on frequentist hypothesis testing of a null versus an alternative: “it is just irrelevant. People put it in their papers because everyone else does it.” No reason as to why it is irrelevant, apparently Anoneuoid’s brief opinion is good enough. I have to wonder how much Anoneuoid will appreciate Bayesian-based methods such as the “Stan”-based analysis everyone is so excited about on the Gelman blog posting, once everyone does that. Do people just dislike a methodology because “everyone else does it”?

Anoneuoid further posits that “Thirdly, there are rare times where the default hypothesis happens to correspond to a plausible hypothesis it is worth ruling out. For example in studying ESP, where we really do expect zero difference between groups of people in a properly designed experiment. However, it seems people don’t accept NHST-based claims in that case anyway.”

Okay – so even if the times are rare, why would that render the methodology “scientifically uninteresting” as originally stated by Anoneuoid? I’m not seeing a consistent case here for or against the methodology, just a lot of hand waving.

Now as to why “people don’t accept NHST-based claims in that case anyway”, I am confused. I certainly accept the NHST-based claims that, to me, clearly and repeatedly show the implausibility of ESP claims. Other people, who want to believe ESP ideas, of course will not accept NHST-based claims in that case. I will venture that such people will not accept any Bayesian-based claims either. I encourage Anoneuoid and other exclusive Bayesians to run non-NHST-based assessments of ESP claims and publish them. When the ESP fans reject those conclusions, will Anoneuoid then look for yet another framework for assessing truth?

One thing at a time:

>”Anoneuoid posits on frequentist hypothesis testing of a null versus an alternative: “it is just irrelevant. People put it in their papers because everyone else does it.” No reason as to why it is irrelevant, apparently Anoneuoid’s brief opinion is good enough.”

This is wrong, I gave multiple reasons. 1) Personal experience that I can ignore all mention of NHST-derived p-values without any loss of information. 2) Rampant p-hacking makes such tests uninterpretable anyway. 3) Evidence it is irrelevant has been coming to light recently in the form of massive problems with reproducing results in every field that relies upon NHST.

Also, I don’t consider the NHST method I am talking about to be representative of Frequentist statistics. My problem is with the choice of null hypothesis. Choose something deduced from the theory motivating the research to be “the hypothesis to be nullified” and would be a totally different story. Bayes factors using the strawman null hypothesis are just as worthless. You are the one who keeps bringing up Frequentists vs Bayesians, not me. I consider that a big red herring.

Anoneuoid:

My apologies, I did indeed misinterpret your stance behind the NHST issue. I stand happily corrected, understanding your position as explained here. Thanks for taking the time to explain.

Rampant p-hacking indeed yields uninterpretable results, but this does not make NHST irrelevant and uninformative – it is the p-hacking technique that is irrelevant and uninformative. If a bunch of patients show ill effect after undergoing plastic surgery, this does not mean that all surgery is useless, but rather that surgical techniques should be properly applied in appropriate situations. The current rash of heroin addiction in the USA does not mean that opioids are useless, but rather that they should be dispensed properly, in moderation to those with most need, not willy-nilly to boost sales of Oxycontin while it is still under patent and highly profitable. P-hacking is a misuse of a useful methodology. To stop p-hacking by banning NHST is essentially what the misguided editor David Trafimow is trying for the journal Basic and Applied Social Psychology. That really is throwing out the baby with the bathwater.

Massive problems with reproducing results has a number of causes: p-hacking, journal editors rejecting studies with null findings, errors in spreadsheets and so on. The scientific arena clearly has some work to do to reduce the rate of irreproducible results. This will happen with proper application of statistical methodologies, proper use of computer programs, and better judgment by journal editors rather than by banning NHST, spreadsheets and journals. That’s what the NAS publication that prompted Mayo to set up this blog topic addresses.

When I read a scientific paper, I expect every assertion to have a reasonable statistical analysis backing up the assertion. I look for evidence that the authors have performed sound statistical analysis, NHST or otherwise. Ignoring properly done tests certainly results in loss of information for me – where then is the sound statistical evidence backing up the assertion? Evidence of p-hacking by authors makes the whole paper uninformative – there’s no information to be lost in such cases, as there is no information to begin with.

If my hypothesis to be nullified is Ho: ( mean1 greater than mean2 ) then I see the nullification of that hypothesis as Ha: ( mean1 less than or equal to mean2 ) Is this what you see it as? Something else? ( One reason Ho: (mean1 equals mean2) was adopted historically was that mathematics under such a condition was simpler (chi-square distributions being central, rather than non-central for example), which was important in the era preceding the cheap availability of electronic computers. )

Feinstein 1988 Science pointed out that on 56 or so questions in epidemiology that the papers for and against each question were equally divided. I took the trouble to count it out, 2.4 vs 2.6 papers. He noted that almost always the question at issue was decided after looking at the data (and p-values). So he called out p-hacking. Did epidemiologist learn? You bet they did. About 2 years later the journal Epidemiology was started and one of the lead articles took the position that there was no need to correct for multiple testing. Gresham’s Law came to epidemiology. The rest is history.

There is little or no real problem with base tests, t-test, chi-square, etc. As you note, it is the strategy of their use that is a big problem. Thought leaders, people that know better, are willfully using an analysis strategy that essentially guarantees the claims will not replicate, but papers will be published.

Stan: Wait a minute, you’re saying that Epidemiology (was that the Rothman journal?) deliberately decided to allow data-dependent choices knowing it would ensure invalid p-values? Wasn’t he the man who banned p-values from the journal? Excuse me if I’ve got it wrong. I’ll look for the Feinstein link.

Rothman started the journal Epidemiology 1990 and in a lead article he took the position that no adjustments for multiple testing were needed.

Stan: OK and you’re saying he did that deliberately in order to make it easier to p-hack? But I thought he banned p-values from the journal, or was that later?

>”If my hypothesis to be nullified is Ho: ( mean1 greater than mean2 ) then I see the nullification of that hypothesis as Ha: ( mean1 less than or equal to mean2 ) Is this what you see it as?”

This paper pretty much establishes my position regarding how to choose a null hypothesis:

Specifically, for NHST to be useful, the hypothesis to be tested needs to correspond to a precise prediction deduced from some real theory people are interested in. What are your comments on the arguments put forward in that paper?

Anoneuoid:

Thanks for the reference, an interesting read by a well regarded scholar back in the day.

The article was published in 1967, almost 50 years ago.

Many changes in statistical analytical paradigms have occurred since this article was written, some no doubt in response to this article. So I would encourage you to read more modern treatises in statistical hypothesis testing methodology.

For example, Meehl discusses an example testing for differences in colour-naming in 2 groups of school children. With 55,000 children in the data set, trivially small differences show a p-value of less than 0.000001.

This is a known issue, and is precisely why in the ensuing 50 years competent statistical analysts identify differences of scientific/medical/biological importance before testing. If the observed difference is smaller than such a scientifically relevant difference, the size of the p-value is irrelevant.

This point is discussed in the NAS by Keith Baggerly: “He cautioned that it is important not to focus solely on p-values for large data sets, because those values tend to be small. Rather, he suggested that the effect size also be quantified to see if it is big enough to be of practical relevance.” (page 79 of NAS document)

This point is also discussed in the ASA Statement on Statistical Significance and P-values: “Statistical significance is not equivalent to scientific, human, or economic significance. Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger pvalues do not imply a lack of importance or even lack of effect.” (Page 10, Item 5 of their statement) Mayo was part of this group, and discusses this document in a more recent blog post here.

Much progress has been made in the last 50 years in specifying reasonable NHST methods that no longer yield the problems aptly discussed by Paul Meehl back in 1967.

Anoneuoid:

“Specifically, for NHST to be useful, the hypothesis to be tested needs to correspond to a precise prediction deduced from some real theory people are interested in.”

Beautifully said.

This phrasing captures the concept that a difference of scientific relevance needs to be specified and the test properly powered so that a definitive conclusion can be reached after the data acquired are assessed with due statistical rigour. Mayo discusses this concept with a difference of relevance denoted by gamma in her discussion of point 5 of the ASA statement in her current blog post “Don’t throw out the error control baby with the bad statistics bathwater”.

I agree that the size of the effect that matters is very situational. We worked on an addative to chicken feed. Used millions of birds in experiment and detected very small, but real effects. Those funded by EPA multiply their very very small effects times the US population to claim thousands of deaths. Observational studies. Any small bias will lead to stat sig. If the effect is positive, then publish. If negative, then file drawer.

Point me to the references you cite.

Thanks.

Deborah: I did insert my comment on multiple testing/p-hacking. Stan

Steven: If you care enough to check Gelman’s blog, you’ll see that what anon says isn’t quite correct. Bayesians were keen to show the actual non-Bayesian analysis could be reconstructed Bayesianly.

The truth is that finding things out in well-developed theoretical sciences doesn’t rely on statistical assessments in the same way as the fields in this book. Interferometers are the most interesting and reliable source of null results in all of science (think Michelson Morely, the Equivalence Principle).

They get perfect cancellations (it can pick up a change as small as something like 1/10,000 of a proton–don’t quote me).

Take a look at the bottom most video clip:

https://www.ligo.caltech.edu/page/what-is-interferometer

We know the instrument does this. We get null results “at will” and all the time, and without real statistical analysis. Yes, we infer 0 effect (within the incredibly precise bounds.)

They already knew of the existence of gravity waves from pulsars in the 70s (not to mention that all non-falsified relativistic gravity theories require them, though they have somewhat different properties in different theories). There’s zero interest in assigning a probability to their existence. What they did here, as I understand it, is use an incredibly well understood instrument to learn about gravitational phenomena in the domain of coalescing black holes and pulsars. Now we have gravitational wave astronomy!

Even though the full background theories are uncertain, scientists don’t need them to be true or probably true, whatever that would mean. The theory independent properties of the instruments and other background factors have been stringently tested, and hold for all viable gravity theories. The building up of knowledge is not from the top down, as a traditional Bayesian posterior updating would require, but from the bottom up (so to speak).

Any fix would seem to depend on journal editors and funding agencies. They should require two things. First, the analysis protocol should be placed on deposit before the study started. Second, the analysis data set should be placed in a public repository. Anything less should clearly be labeled, “This is exploratory research and claims need to be independently replicated before used.”

Poor methods, testing multiple questions with no correction for multiple testing and the selection of the best model from multiple modeling, are often used to dredge for small p-values. It is Gresham’s Law for science. Poor science drives out good science.

Within industry there is much less pressure to publish. There is a lot of experimentation, but relatively little publishing. There is a lot of oversight. The final products are things that work and that someone will pay for.

An industry randomized clinical trial has a protocol before the experiment is run. There is internal and external oversight. Launching a new drug is expensive so managers do not want to throw good money after bad.

Those more familiar with university research should comment. I suspect that university research quality is uneven.

Stan: “suspect that university research quality is uneven”

That was my experience, even between researchers that collaborated, if the study was done in teaching hospital A double data entry needed to be done and audited prior to any analysis while in teaching hospital B things like double data entry were considered a waste of scarce research funds.

The smarter researchers would personally arrange for double data entry when they got stuck on a study at teaching hospital B – but not most.

Given almost all the time nothing is ever audited – who can tell the good from the bad?

Unfortunately, journal editors and funding agencies unlikely have the incentives or resources to redress deficiencies in academic process.

But even this NAS report is a sign that things will likely start to change…

Keith O’Rourke

Don’t forget that Industrial research can be very adversarial. Internal budgets and promotions are based on project success/failure, which leads to a ruthless weeding of BS based theories. Then when an improvement hits the market, many competitors will be attacking it (e.g. Ad Claims), imitating it, or trying to circumvent it while, simultaneously, the company tries to cost-optimize it. Finally, if the customers reject it, that kicks back on the entire project team. In my experience academics has nothing like this at all.

False alarm rates in LIGO

To carry out this statistical analysis we used 16 days’ worth of stable, high quality detector strain data from the month following the event. GW150914 was indeed by far the strongest signal observed in either detector during that period. We then introduced a series of artificial time shifts between the H1 and L1 data, effectively creating a much longer data set in which we could search for apparent signals that were as strong (or stronger) than GW150914. By using only time shifts greater than 10 milliseconds (the light travel time between the detectors) we ensured that these artificial data sets contained no real signals, but only coincidences in noise. We can then see, in the very long artificial data set, how often a coincidence mimicking GW150914 would appear. This analysis gives us the false alarm rate: how often we could expect to measure such a seemingly loud event that was really just a noise fluctuation (i.e. a ‘false alarm’).

Figure 4 (adapted from figure 4 of our publication) shows the result of this statistical analysis, for one of the searches carried out on our detector data. ….This means that a noise event mimicking GW150914 would be exceedingly rare – indeed we expect an event as strong as GW150914 to appear by chance only once in about 200,000 years of such data! This false alarm rate can be translated into a number of ‘sigma’ (denoted by σ), which is commonly used in statistical analysis to measure the significance of a detection claim. This search identifies GW150914 as a real event, with a significance of more than 5 sigma.

I’m currently teaching Statistical Experimentation and I think that we don’t talk enough about the issue of generalisation. One important thing that I teach is to tell apart randomisation from random sampling (this seems to be surprisingly difficult for many).

In much experimentation doing randomisation is realistic whereas random sampling is not. But this means that the scope of generalisation is unclear. Let’s say we do a proper RCT in a particular hospital or lab using convenience sampling, at a particular time. To what population do results generalise? On what other samples we’d expect results to be reproducible?

Obviously, we’d like to generalise at least a bit. We’d like to say that some specific circumstances of our RCT are (more or less) irrelevant to the result and therefore the result should reproduce in at least slightly different circumstances. It would be nice if authors of studies fleshed this out – on which circumstances they consider irrelevant would they base claims of generalisation? What scope of reproducibility is intended?

(The Bayesians will tell you that they could have priors formalising this.)

Christian: “Let’s say we do a proper RCT in a particular hospital or lab using convenience sampling, at a particular time. To what population do results generalise? ”

Strictly speaking the conclusion refers just to the participants, unless they happen to constitute a random or representative sample of some other population. Steven will know more about clinical trials.

Psychology experiments refer to captive students (generally required to participate in some experiment), and replication checks are also, generally, on captive students.

“Strictly speaking the conclusion refers just to the participants, unless they happen to constitute a random or representative sample of some other population.”. Isn’t this the reference class problem?

Mayo: Still replication will happen at a different time and perhaps in a different institution. So there is an implicit assumption that we cannot generalise to groups other than captive students, but we may generalise an experiment from June 2004 to August 2006 (if replication takes place then).

I’m not saying that such an assumption shouldn’t be made, by the way, but rather I only want to point out that replication relies on such generalisation decisions.

John: This is some kind of “hands on”/”real study” variation on the reference class problem. I think normally when people talk about the “reference class problem”, this refers to the computation of probabilities rather than to the generalisation of study results.

Christian,

In my experience generalization is typically covered in sampling, observational studies (epidemiology & public health) and forecasting areas, under such names as

– matching

– raking

– post-stratification

– MRP (Gelman’s Mister-P) multiple regression & post-stratification

etc.

All of these require either a stable effect or some detailed modeling of the effects at whatever level you wish to forecast (e.g. empirical bayes style). I learned those after the basic experimental design courses.

Pearl & Bareinboim have also been doing work on making causal graphical models “transportable.”

John: I take the reference class problem refers to the appropriate class to employ in direct (frequentist) inference. Direct probabilistic inference is, say, assigning a probability to “John is a Swede”, given he’s been selected from a group with k% Swedes. If you also have information about the % of A’s that are Swedes, and that John has property A, you might use that ref class instead. How narrow? How precise? There’s a whole philosophical literature (Salmon, Levi, Kyburg and others) who discuss this. My understanding of the ref class problem might be due to my being a philosopher, and that’s how philosophers understand it. I’m not saying the term cannot be used more broadly, say to decide about relevant error probabilities and conditioning. I think it’s analogous.

I have read some of the literature, but interpreted the reference class as being the “target” of the study, to includes sampling strategy as well as statistical analysis.

Christian, I’m pleased to hear that another teacher of stats focusses on scope of inferences from samples. I spend the first two hours of my nine hour course (yes, too short…) on sampling issues including the distinction between randomisation and random sampling, convenience samples and reference sets. For a non-expert user of statistics those things are more important than most of the things that we discuss here.

Take a look at this challenge to the psych replication crisis reports.

http://www.nytimes.com/2016/03/04/science/psychology-replication-reproducibility-project.html?mwrsm=Email

Note my earlier blog on this:

https://errorstatistics.com/2015/08/31/the-paradox-of-replication-and-the-vindication-of-the-p-value-but-she-can-go-deeper-i/

“Since it’s expected to have only around 50% replications as strong as the original, the cases of initial significance level < .02 don’t do too badly, judging just on numbers.

[i] Here’s a useful overview from the Report in Science:

Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.

Since it’s expected to have only around 50% replications as strong as the original, this might not seem that low. I think the entire issue of importance goes beyond rates, and that focusing on rates of replication actually distracts from what’s involved in appraising a given study or theory."

This contribution refers to pages 48-49 of “Statistical Challenges in Assessing …” metioned by Mayo in the opening contribution. I am not Stephen Senn but here is my take. The pages 48-49 refer to Boos and Stefanski, http://amstat.tandfonline.com/toc/utas20/65/4. The following is a somewhat more detailed of a previous contribution. The model is the standard Gaussian model. The data are i.i.d. N(mu,sigma^2) and the question to be addressed is whether the specified value mu_0 is also consistent with the data. The mean {\bar X}_n of a samples of size n has the representation mu+sigma Z_n/sqrt(n) where Z_n is N(0,1). Thus sqrt(n) ({\bar X}_n-mu_0)/sigma=sqrt(n)(mu-mu_0)/sigma +Z_n. I consider only the one-sided situation mu_0>mu. Given a level of significance say 0.05 the difference mu-mu_0 will be declared significant if pnorm(sqrt(n)(mu-mu_0)/sigma)<0.05, that is sqrt(n)(mu-mu_0)/sigma<-1.96. Consider now a joint 0.95 approximation (confidence) region for (mu,sigma). For mu I take the standard one-sided interval (-infty , {\bar X}_n+1.96sd(X)/sqrt(n)] and for sigma the interval (0,sd(x)sqrt(n-1)/sqrt(qchisq(0.025,n-1))]. Let mu_n ans sigma_n denote the upper endpoints of these intervals. Then with probability 0.95 mu<=mu_n and sigma<=sigma-n. It follows sqrt(n)(mu-mu_0)/sigma<sqrt(n)(mu_n-mu_0)/sigma_n with probability 0.95. Thus if pnorm(sqrt(n)(mu_n-mu_0)/sigma_n) <0.05 then with probability 0.95 pnorm(sqrt(n)(mu-mu_0)/sigma)<0.05. As an example take the copper data I have mentioned before with n=27, mean 2.016 and standard deviation 0.116. The null hypothesis is H_0:mu=2.1. The standard P-values is 0.000436. The P-value based on mu_n and sigma_n is 0.112. Is this too pessimistic?. Not at all. To see this simulate the P-values for H_0 based on the empirical values mu=2.016 and sigma=0.116. Based on 1000 simulations the 0.95-quantile of the P-values was 0.0252. However mu=2.03 and sigma=0.13 are also consistent with the data. The estimated 0.95-quantile for the P-value based on these parameter values using 1000 simulations was 0.142. Bootstrapping as considered in Boos and Stefanski gives an upper bound of the same order. Bootstrapping standard normal samples of size n=27 with H_0: mu=0.72 (corresponding to mu=2.016, sigma=0.113 and H_0:mu=2.1) results in standard deviations of log(P-value) of between 2.02 and 3.97 (5%- and 95%-quantiles). Taking 1.64*85%-quantile= 1.63*3.3=5.41 results in an upper bound of 0.000436*exp(5.41)=0.0975 which is not too different from 0.112. A lower bound for the P-value can be obtained in the same manner. For the copper data and H_0:mu=2.1 it is 3.84*10^-8 resulting in an interval (3.84*10^-8,0.112) of plausible P-values for H_0:mu=2.1. Is the lower bound too small? Again it is not. mu=1.99 and sigma=0.95 are consistent with the data. For standard normal data these values correspond to H_0:mu=1.15. Based on 10000 simulations the 10%-quantile of the P-values is 8.265367e-09.

Anonymous is me, something went wrong.

It is quite doubtful about LIGO’s claim of having detected gravity wave. If you imagine a tube of 1 light-year radius and 1.4 billion light years long you will have about 17.6 million solar starts inside the tube. For 100-200Hz gravity wave, the wave length would be 1500-3000km which are still much smaller compared with the solar radius of 0.7 million km. So, statistics shows there must be dispersion effect on different wavelength waves; the longer wavelength wave will travel faster than the shorter wavelength one and the shorter wavelength wave will suffer more amplitude damping. Now the challenge is, then, how can LIGO get the gravity waveform exactly the same as the theoretical calculations for two black holes merging without considering any dispersion effects?