“Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)

Posted on July 17, 2015 by Mayo

Mayo, frustrated

Someone linked this to me on Twitter. I thought it was a home blog at first. Surely the U.S. Dept of Health and Human Services can give a better definition than this.

U.S. Department of Health and Human Services
Effective Health Care Program
Glossary of Terms

We know that many of the concepts used on this site can be difficult to understand. For that reason, we have provided you with a glossary to help you make sense of the terms used in Comparative Effectiveness Research. Every word that is defined in this glossary should appear highlighted throughout the Web site…..

Statistical Significance

Definition: A mathematical technique to measure whether the results of a study are likely to be true. Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

Example: For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to be statistically significant because p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

You can find it here. First of all, one should never use “likelihood” and “probability” in what is to be a clarification of formal terms, as these mean very different things in statistics.Some of the claims given actually aren’t so bad if “likely” takes its statistical meaning, but are all wet if construed as mathematical probability.

What really puzzles me is, how do they expect readers to understand the claims that appear within this definition? Are their meanings known to anyone? Watch:

Statistical Significance

A mathematical technique to measure whether the results of a study are likely to be true.

What does it mean to say “the results of a study are likely to be true”?

Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance.

Meaning?

Statistical significance is usually expressed as a P-value.
The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).

How should we define “more likely that the results are true”?

Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

oy, oy

The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

~~Oy, oy, oy~~ OK, I’ll turn this into a single “oy” and just suggest dropping “probably” (leaving the hypertext “probability”). But this was part of the illustration, not the definition.

Surely it’s possible to keep to their brevity and do a better job than this, even though one would really want to explain about the types of null hypotheses, the test statistic, the assumptions of the test (we aren’t told if their example is an RCT.) I’ve listed how they might capture what I think they mean to say, off the top of my head. Submit your improvements, corrections and additions, and I’ll add them. Updates will be indicated with (ii), (iii), etc.

Statistical Significance

A mathematical technique to measure whether the results of a study are likely to be true.
a) A statistical technique to measure whether the results of a study indicate the null hypothesis is false, that some genuine discrepancy from the null hypothesis exists.

Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance.
a) The statistical significance of an observed difference is the probability of observing results as large as was observed, even if the null hypothesis is true.
b) The statistical significance of an observed difference is how frequently even larger differences than were observed would occur (through chance variability), even if the null hypothesis is true.

Statistical significance is usually expressed as a P-value.
a) Statistical significance may be expressed as a P-value associated with an observed difference from a null hypothesis H₀ within a given statistical test T.

The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).
a) The smaller the P-value, the less consistent the results are with the null hypothesis, and the more consistent they are with a genuine discrepancy from the null.
b) The smaller the P-value, the greater the probability a smaller value of the test statistic would have occurred, were the data from a world where the null hypothesis is adequate.

Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).
a) Researchers generally regard the results as inconsistent with the null hypothesis if statistical significance is less than 0.05 (p<.05).

(Part of the illustrative example): The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.
a) The probability that even larger differences would occur due to chance variability (even if the null is true) is high enough to regard the result as consistent with the null being true.

7/17/15 remark: Maybe there’s a convention in this glossary that if the word is not in hypertext, it is being used informally. In that case, this might not be so bad. I’d remove “probably” to get:

b) The probability that the results were due to chance was high enough to conclude that the two drugs did not differ in causing blood pressure problems.

7/17/15: In (ii) In reaction to a comment, I replaced d_obs with “observed difference”, and cut out Pr(d ≥ d_obs ;H₀). I also allowed that #6 wasn’t too bad, especially if (the non-hypertext) “probably” is removed. The only thing is, this was not part of the definition, but rather the illustration. So maybe this could be the basis for fixing the others in the definition itself.

Categories: P-values, Statistics | 69 Comments

69 thoughts on ““Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)”

July 17, 2015

Geo

Sorry, but you’ll lose the general public at the first appearance of “D-obs” or if not then, then at the first mention of “null hypothesis”. Yes, their explanation is flawed if viewed by a statistician and yes, it’s OK when it comes to general public explanations since you have to cater such messages to the lowest common denominator.

Reply

July 17, 2015

Peter Chapman

Statistical methods therefore have a unique status within science of being heavily used by people that have no understanding of what they are actually doing.

Reply

July 17, 2015

Geo

Agreed. Same goes for the world outside of science that tries to use statistics, for example Marketing, which is something I’m focusing on.

Reply

July 17, 2015

Mayo

Geo: This glossary was to be part of an interconnected set of definitions, and my putting dobs as a shorthand for the observed difference is irrelevant, I’ll even take it out. It’s THEIR claims that I think NO-ONE can understand because they have no meaning, and some of them are utterly foreign to any terminology I’ve ever seen. Can you parse them to arrive at anything like a correct definition?

Reply

July 17, 2015

Geo

If you ask a layman to read it he’ll roughly get the idea of stat significance, probably with the usual misunderstandings (esp. if you probe deeper). However there is a point beyond which a concept can’t be expressed in a simpler manner without losing it’s true/full meaning and the level of this explanation is past that point. In fact, most other explanations on the site are of 1 or 2 sentences so this one is actually unusually lengthy.

Also: probability and likelihood are completely interchengable in everyday language (e.g. “like·li·hood (līk′lē-ho͝od′)
n.
1. The state of being probable; probability.
2. Something probable.” ). When they are not mingled as above they are listed as synonims.

The 0.05 p value point serves as a somewhat useful reference-point/example for the complete novice.

Reply

July 17, 2015

Mayo

Geo: I don’t think a layman would get it. By the way, their other definitions, the ones I looked at, aren’t nearly as distorted and don’t try to dumb things down either.

Reply

July 17, 2015

Christian Hennig

In my view, the funniest thing, or, depending on my mood, the most worrying, is the wording “more likely that the results are true”.

I observe that sometimes students seem worried that something went wrong if things don’t come out significant. Although I kind of know the culture that breeds such ideas, this always felt a bit bizarre to me, but in this definition it’s just outrageous.

Reply

July 17, 2015

Mayo

Christian. Yes, what in the world could it mean for the results to be “true” as oppose to false maybe?

Reply

July 17, 2015

David Oliver

What about their example demonstrating how to interpret a confidence interval?

“For example, a study shows that the risk of heart attack from a drug is 3 percent (0.03). The confidence interval is shown as “95% CI: 0.015, 0.04.” This means that if you conduct this study on 100 different samples of people, the risk of heart attack in 95 of the samples will fall between 1.5 percent and 4 percent. We are 95 percent confident that the true risk is between .015 and .04.”

This looks to my lay (but hopefully soon to be enlightened) eyes as being what my son would call a FAIL. Thoughts?

Reply

July 17, 2015

Mayo

David: Oy, I didn’t look at that, but will.

Reply

July 17, 2015

Michael Lew

Am I the only one to feel that the absence of the word “evidence” in this discussion is indicative of the most epic failure imaginable on the part of statistics and philosophy?

Both the general public and researchers are, or should be, interested in the evidential meaning of the data, and surely P-values have something to do with that evidence. Our inability to talk about evidence in an acceptable manner is what stops us from being able to explain P-values to each other.

Reply

July 17, 2015

Mayo

Michael: Well I don’t know if it’s a failure of philosophy (italics under “failure”, since philosophers don’t tend to address statistical evidence as it occurs in practice*), but I entirely agree that we should be talking about evidence. One thing: since we’re given this without background, and without nuance or context, the weakest construal seems warranted. An isolated statistically significant result, I often say, “indicates” such and such, and only when it passes an “audit” would I say there’s evidence. Evidence of a genuine effect, as Fisher insisted, requires more than an isolated low P-value. To reflect the stronger, evidential claim, I’d give the following “b” versions to #1 and #5:

#1 b) A statistical technique to measure whether the results of a study are evidence that there is a genuine effect, and thus that the null hypothesis is false.

#5 b) Researchers generally regard the results as evidence of inconsistency with the null if statistical significance is less than 0.05

*except to concede as Peter Achinstein does, that philosophical notions of evidence are irrelevant to scientists because they are a priori. But never mind all that.

Reply

July 17, 2015

Michael Lew

Your new 1b and 5b are improvements. They do seem to imply that evidence has the binary, all or none, property of existing or not existing. My own conception of evidence is more interesting than that: evidence can vary in degree, in convincingness, in specificity and in target, and I don’t think that my conception differs markedly from standard intuition.

Reply

July 17, 2015

Mayo

Michael: Well my notion of evidence for a claim C has degrees too, according to things like precision, accuracy and how severely tested C is. I was trying to improve on some “definitions” that are close to meaningless or misleading, without being told it’s too nuanced or technical for this glossary. Here, inconsistency is at a level. Perhaps your “convincingness” is akin to my “well-tested”. But I don’t want evidence to reflect rhetorical or psychological persuasive power, which is what “convincingness” sounds like, but epistemic-evidential warrant or corroborative force.
Please give me a full alternative statement under any of 1-6 and I’ll post it along with your name. A crowd sourced definition, you might say. Likely to be better than crowd-sourced replication attempts. I can hear the booing. Just kidding, …mostly.

Reply

July 17, 2015

Sander Greenland

Oh Canada!
Take a look at the bottom paragraph on p. 74 of
https://www.nji-inm.ca/index.cfm/publications/science-manual-for-canadian-judges/?langSwitch=en

Reply
July 17, 2015

Mayo

Sander: So who wrote this? Some of it isn’t so bad, except for the tone. However, it is inconsistent: all or nothing, and 95% as a high standard of proof. Please give the background scoop on this Canadian book for judges, or why it strikes you as akin to the dept. of Health and human services glossary.

Reply

July 17, 2015

phaneron0

> which we _interpret_ as the null hypothesis is …

That is a notable point being made.

So ya, Sander: So who wrote this?

Keith

Reply

July 18, 2015

Sander Greenland

Your guess about the author is as good as mine – in fact better than mine for Kieth.

So you have no problem with statements like alpha=0.05 corresponds to a “scientific attitude that unless we are 95% sure the null hypothesis is false, we provisionally accept it”?
Or that “a 95% threshold constitutes a rather high standard of proof.”?

Reply

July 18, 2015

phaneron0

I am a bit surprised there is a 3 day course and manual on science for Canadian judges.

When I get back from vacation, I’ll look into how it was put together (I am sure they will be open to criticism of their work).

It would nice to have a least wrong write of this for those not in an intro statistics course (i.e. self contained and readable.)

Kieth

Reply
July 18, 2015

Mayo

Sander: No I do have a problem, just that it’s different from the ones I have with the glossary. I’m also put off by the tone. I will look at some of the rest of it. I mentioned it to Schachtman.

Reply

July 18, 2015

Nathan Schachtman

Mayo, Sander,

Thanks for the reference. I was unaware of the NJI volume. I see that Joe Cecil of the Federal Judicial Center, and Brian Baigrie of the Jackman Institute in Toronto had some peer review responsibilities for this document. I will have to ask them about the work.

I have not read it carefully, but there seem to be some howlers on basic definitions, and some rather idiosyncratic, Feyerabend-like views in the manual. The U.S. version, the Reference Manual on Scientific Evidence, which may have inspired the Canadian text, is put out jointly by the National Research Council and the Federal Judicial Center. The Reference Manual has a chapter dedicated solely to statistics by David Kaye and the late David Freedman (and others to regression analyses, epidemiology, clinical medicine, etc.). For all its shortcomings, the Reference Manual seems like a much more solid production.

Nathan

Reply

July 18, 2015

Mayo

Nathan: Where are the Feyerabend-like views? I’ve only looked at the part Sander called our attention to.

Reply

Pingback: Distilled News | Data Analytics & R
July 18, 2015

Nathan Schachtman

In skimming the index, I saw entries for the myth of objectivity, etc. Of course, this would be inconsistent with the stated mission of helping judges discern weak from strong scientific claims. I have not read any of the text except a few sentences around a search for 95%, which revealed some of the disturbing language that Sander referenced.

Reply

July 18, 2015

Mayo

Nathan: Actually I find nothing to disagree with in the “myth of objectivity” section although they should not have called it that. However, I came upon a discussion of the difference between Bayesian and frequentist probabilities which commits a serious error concerning adjustment for selection(p. 131). It concerns DNA matching through data bases. Cox and Mayo discuss it on p. 270 of our paper. http://www.phil.vt.edu/dmayo/personal_website/Ch%207%20mayo%20&%20cox.pdf

Reply

July 19, 2015

john byrd

The manual says the frequentist way to deal with fishing in a database is to add up the probability of a random match in each case to get a probability of at least one random match, as in Np. This is not a probability but a number of individuals. We want to use: (1 – (1- p)^N) to get the desired probability of at least one random match.

I also think the whole depiction of scientific reasoning for frequentists and Bayesians is confusing. The presentation is circular in saying 1) we get a match for a crime scene sample while fishing in the database, then 2) make the unlucky person who matched the sample a suspect and take a DNA sample from him/her, then 3) compare a new DNA sample from our new “suspect” to the crime scene sample for confirmation. First, assuming we are using the same test (autosomal STRs, say), then of course there will be a second match unless the DNA labs made an error,which is rare. We might find at this stage that the proper estimate of at least one random match is very small, because we are comparing one person’s complete profile to a complete profile. However, when there are partial profiles from degraded samples the RMP might be much larger. The only way to go beyond the Inference possible using the probability of at least one match in the fishing exercise is to do a different, independent DNA test, such as Y-STRs or mtDNA which allows you to test anew the hypothesis that the person’s fluids were in the crime scene. If one or more independent tests match as well, the error probabilities combine to be very small. Any mis- match is an exclusion.

Reply

July 19, 2015

Mayo

John: I’ll read this more carefully. The point I was making is that it’s just mistaken to allege that frequentists pay a penalty in this type of searching for a known effect. Explaining a known effect, especially with a reliable method, has a completely different logic. It’s like the example I happened to be talking to Schachtman about yesterday: searching for an animal in which to show the teratogenic effects of thalidomide. They finally found the New Zealand rabbit. Or, finding my keys as in that cartoon I once posted.
https://errorstatistics.com/2012/07/21/always-the-last-place-you-look/

Reply

July 19, 2015

john byrd

I agree frequentists pay no penalty. The error probabilities must come from the right calculation as with any other application. I will also say that the statements about the Bayesian approach do not make sense to me (and I think I do understand the Bayesian reasoning used in practice).

Reply

July 20, 2015

Mayo

John: I have tried to give the logic and rationale to direct the correct calculation. I don’t know that anyone has tried to make these distinctions at all systematic.
Incidentally, I was looking at your commingled remains chapter last night. I thought you had some resampling exs, maybe elsewhere.

Reply

July 20, 2015

john byrd

You will see some bootstrap results for an omnibus statistic in that chapter.

Fishing in pools of potential suspects can be like what you have called cherry-picking. It is a great tool for discovery, but one must make sure the error probabilities are properly determined before flipping the result into confirmation. This is true for the frequentists and Bayesians (where the likelihoods include a random match prob in the denominator).

Reply

July 20, 2015

Mayo

John: What’s the random match probability in the denominator?

Reply

July 21, 2015

john byrd

The likelihood component usually compares the probability of the match, given the scene sample came from the suspect (approx 1.0) to the probability the scene sample is a random match. Random match probabilities can be very small when comparing DNA from an individual to a sample possibly from the same individual.

Reply

July 18, 2015

Nathan Schachtman

Yeah; I agree that the “headline” seems misleading, and it is certainly out of keeping with the mission statement of the volume. I need to sit down and read it carefully, and sympathetically, which won’t happen for a while now.

Reply
July 21, 2015

Mayo

Andrew Gelman blogs my blog today with some comments of his own: http://andrewgelman.com/2015/07/21/a-bad-definition-of-statistical-significance-from-the-u-s-department-of-health-and-human-services-effective-health-care-program/
Haven’t read it yet, he just sent me the link.

Reply
July 21, 2015

Mayo

I wish Gelman wouldn’t just say tests of hypotheses, or significance tests, are just bad, bad, bad and should never be used. That’s not so, even if in most (but not all ) cases they should be supplemented with CIs or, even better, a severity analysis (to begin with). The howlers he gives generally allude to fringe sciences or bad statistics (or both). In the case of the former, there is no way that any improved statistics can improve their inference, but at most can identify where they’re illicit. Guess what tools we turn to in order to identify this illicitness? Yes, significance tests. One should always ask: what superior methods has he given? To just trash an entire methodology that is, for example, at the heart of randomized treatment control trials in medicine, and without which there are no tests of model assumptions, only encourages the kind of mindlessness and name-calling* that we should be fighting.

*As in: ban significance tests rather than use them correctly for what they’re intended to do, or blame the tools when its lack of self-criticism of the user that’s to blame.

Here’s a spoof on “statistical task forces”: https://errorstatistics.com/2015/01/31/2015-saturday-night-brainstorming-and-task-forces-1st-draft/

Reply

July 21, 2015

kogdato

my guess:
“IF someone is not sure it is possible to use significance test CORRECTLY for any reason,
then better don’t use and make a note: “significance unknown”

Otherwise, if someone making concrete claims about signifcance levels, CIs etc., than he MUST PROVE the correctness of the test used/results etc.

Reply

July 21, 2015

Mayo

Kogdato: Oh right, like we’re ever able to PROVE our inquiries will result in CORRECT claims or will not land in error for any reason. That’s absurd. In the case of statistics, the best we can do is probe for flaws, and in order to do that we will find ourselves turning to significance tests.

Reply

July 21, 2015

Mayo

This was the remark, quite puzzling given his own use of P-values:
“I don’t think people should be making decisions based on statistical significance in any case. In my ideal world, we’d be defining statistical significance just as a legacy project, so that students can understand outdated reports that might be of historical interest.”

If you burned all the error statistical books, they’d have to be reinvented, if we’re to learn about certain types of variable phenomena. There’s no essential difference between using N-P statistics, CI intervals, significance tests, or general error statistical methods–they are appropriate for different uses and solving different problems. They may be unified in many different ways. My own favorite is by using them to formalize the notions of a severe test, corroboration, and self-correcting intended by Popper and Peirce, respectively.

Reply
July 22, 2015

Mark

Mayo, completely agree! In fact, in RCTs, I strongly believe (and can provide many, many references from the RCT literature) that the significance test is of primary importance. Models, estimated “effect sizes”, confidence intervals, etc., are certainly of interest, but they are secondary to the test.

Reply

July 22, 2015

Mayo

Mark: I would love to have a few references from you, whenever you get around to it, or if they’re in a source you can mention. Getting the refs from someone “on the inside” is best. I’d like to include a few, also,in my book. Thanks so much.

Reply
July 23, 2015

Anoneuoid

Mark wrote: “Mayo, completely agree! In fact, in RCTs, I strongly believe (and can provide many, many references from the RCT literature) that the significance test is of primary importance.”

I have difficulty thinking of a case this could be justified. A significance test (as applies to an RCT) only tells you the data in column A is different on average than the data in column B.

1) Without effect size it is not possible to sanity check the outcome.
-No experiment is perfect, and those with humans are very difficult to control. Anything that “goes wrong” will invalidate the assumptions of the significance test.

2) Even the most well designed and implemented RCT will never contain subjects that are EXACTLY the same at baseline or experience exactly nothing that may affect the outcome during the course of the study.
– It really is implausible that two samples came from the same hypothetical infinite population.
– Did the people running the RCT *really* randomize everything? E.g. the order pipettes, scalple, etc are used? Usually this is not practical and the answer is no.

3) Even if a difference is due to the treatment and it is substantial, that is not enough to say the drug should be used or the theory that motivated the trial is correct.
– Unknown to the researcher, an anti-cancer drug may work by killing gut cells leading to caloric restriction. It would be much cheaper and safer to have patients eat less than to give them the drug.
– An animal trial of a drug meant to improve intelligence may test the animal on some food acquisition task. Of this drug makes the animal hungrier they may be more motivated and so appear to be “smarter” than controls. Is this likely to translate to situations humans care about?
– The outcome measure may not be exactly what is important to treatment decisions (eg years of survival vs quality of life for a drug with bad side effects)

In most cases, “chance” is really shorthand for “too insubstantial to be worth further work at this time”. So why not just look directly at the observed difference, you needed to calculate it to get that p-value anyway?

Statistical significance (as used in RCTs) does not tell us anything in addition to what we would gather from looking at the distributions of results and discussing the multiple possible explanations for the differences. Of course, the arbitrary cut-off and all the misuses just make these problems work.

Reply

July 23, 2015

Mayo

Anon: Get thee immediately to Stephen Senn’s post (which I’m about to reblog): https://errorstatistics.com/2012/07/09/stephen-senn-randomization-ratios-and-rationality-rescuing-the-randomized-clinical-trial-from-its-critics/

Reply

July 23, 2015

Mark

Mayo, sorry, I’ve been busy today teaching about, ironically, RCT design. I have some initial references in mind and will post them later this afternoon (for a starter, I’d suggest, again ironically, this paper of Senn’s that I linked to on that post that you linked to: http://www.ncbi.nlm.nih.gov/pubmed/7997705 — it’s not *exactly* about what I was saying above, but it describes very well the foundation for it — although I’m not sure that Senn would completely agree with me on the super-primacy of the hypothesis test over any estimate of effect size).

Anon: Your point 1 is a good, and quite real, one, but to me it doesn’t make sense to address those issues by simply making additional strong, untestable assumptions. For randomization inference, we need one (mostly untestable) assumption — non-differential loss. Note that this doesn’t simply mean similar numbers lost across groups, it’s much stronger than that.

Your second point is, unfortunately, completely irrelevant — see Senn’s post that Mayo linked to and the paper that I linked to. Your third point, while true, is also irrelevant. RCTs are not mechanistic studies — that is, randomization won’t tell us why a drug works, only that it seems to have had different effects, relative to the comparator, for at least some of those randomized.

Reply

July 23, 2015

Anoneuoid

Mark wrote: ” Your point 1 is a good, and quite real, one, but to me it doesn’t make sense to address those issues by simply making additional strong, untestable assumptions… RCTs are not mechanistic studies — that is, randomization won’t tell us why a drug works, only that it seems to have had different effects, relative to the comparator, for at least some of those randomized.”

So you agree that a plausible assumption is that 100% of RCTs are imperfect? In that case, how can you conclude the difference is due to the drug? The real answer, which is used informally by people who have seen the jungle of error it takes to run one is: because the effect size is large enough that it would require great levels of imperfection to cause it. The assumption is that this would have been noticed.

My point two is not really addressed by the Senn post. Say you have a population of 50% males and 50% females and take a sample. As sample size increases both the probability of very unbalanced and exactly balanced samples decreases. You will always have larger between group variance than within group variance if there are any subsets.

But really it is point 3 that matters. The outcome of a significance test after an RCT can range from extremely to not at all informative. We cannot even guess without additional information. That is why I believe it is a spurious data reduction step.

Reply
July 24, 2015

Mark

Hi Mayo,

Here are some references (most of which you undoubtedly already know), off the top of my head:

First, of course, there’s Fisher in DOE, which is of course all about testing (“every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis”, and so on).

Thomas Cook and David DeMets have a fabulous book from 2008 called Introduction to Statistical Methods in Clinical Trials, in the Preface for which they lay out their 3 underlying principles for randomized trials. Their second principle is that “RCTs are primarily hypothesis testing instruments. While inference beyond simple tests of the [study] hypotheses is clearly essential for a complete understanding of the results, we note that virtually all design features of an RCT are formulated with hypothesis testing in mind. … Even in the simplest situations, however, estimation of a ‘treatment effect’ is inherently model-based, dependent on implicit model assumptions, and the most well conducted trials are subject to biases that require that point estimates and confidence intervals be viewed cautiously.” Personally, I completely agree with the quoted text.

Then there’s Oscar Kempthorne’s 1977 paper in Journal of Statistical Planning and Inference, pp. 1-25, called “Why Randomize?” Or his 1979 paper in Sankhya pp. 115-145, called “Sampling Inference, Experimental Inference, and Observational Inference.”

David Freedman has papers on regression models (http://www.stat.berkeley.edu/~census/neyregr.pdf), logistic regression models (http://www.stat.berkeley.edu/~census/neylogit.pdf), and proportional hazards regression models (chapter 11 of his posthumously published book Statistical Models and Causal Inference) in experimental studies.

There’s this paper, which is one of my favorites: http://www.ncbi.nlm.nih.gov/pubmed/?term=groundhog+day+cause+and+effect

There’s this one, which is also a good paper: http://www.ncbi.nlm.nih.gov/pubmed/8220408

Hope these are helpful!

Reply

July 21, 2015

knk

so… at least 50 years of explanations “what is statistical significance” in every single textbook on stats… and.. epic fail 😀

still 99% of people can not understand what is it

maybe it would be better to cut…

Reply

July 21, 2015

Mayo

Knk: but, to my knowledge, textbooks don’t mutilate their defn of stat significance as this glossary does. Any probabilistic notions can and are easily butchered, partly because of the variable uses of “probability” and “likelihood” in ordinary English, and partly because of certain confusions about the role of probability in statistical inference. As you can see, the correct definitions aren’t so far from what was written (once you get into their statspeak), and that’s why I started with their sentences.
I don’t know any notions from formal statistics that are not misinterpreted. I don’t buy this whine, “oh it’s all just too hard to learn.” I think most laypersons grasp the value of randomized control trials in medicine in showing how some “benefits” we’d been sold are due to biases, revealed by statistical significance tests, e.g., hormone replacement therapy. Once I realized they were focusing on that use of tests (which makes sense given the audience), it was easy to clean up the wording.

Reply

July 21, 2015

knk

As for me, textbook definitions are OK in most cases
But the fact is: people just can’t understand what is it.
I am almost sure that the absolute majority of people
understand “significance” in a way US HHS wrote:
“technique to measure whether the results of a study are likely to be true”

I just think the words such as “significance”, “confidence intervals” are just irrelevant and confusing for most people. And it would be better to use some specific terms for such a complex concepts, and not just ordinary words. Logically the exact meaning of “statistical significance” term has not so much in common with what “common sense” calls a “significance”.

Reply

July 21, 2015

Mayo

What do you mean “the results of a study are likely to be true”?
Error probabilities ARE specific terms for complex concepts, and quite easy to understand correctly, if one wants to.
The word “significance” isn’t necessary for the tests to be understood.

Reply

July 22, 2015

knk

Agreed. Thats what i am trying to say. The word “significance” isn’t necessary. And it would be better to don’t use it at all in the publications. The words such as “significance” or “confidence limits” just confusing the public (and many researchers).

Personally I would prefer to use the words such as “probability of Type I/II error”.

Reply

July 22, 2015

Mayo

Knk: you mean it would be better not to use it. I’m not sure they wouldn’t be as or more confused with the probability of a type I error, but it would be OK so long as they reported the actual type I error, which is the P-value. The type 2 error, however, requires specifying an alternative, which in my judgment should be done, even if it’s merely directional. There are one or two exceptions. Then what to do with confidence intervals (CIs)? I’ve developed a notion of the severity of a test (in relation to a specified discrepancy) that works for all these error probabilities as well as CIs, and precludes dichotomous reasoning of so-called NHSTs.

Reply

July 22, 2015

knk

The idea is simple:
– if we using something very specific for complex concept (e. g. special terms like “p-value or probability of type I error” or just (and maybe this even better) just an abstract algebraic notation without any “common” words at all) – then there is no confusion possible, because no one can understand what does the special term means without checking the strict explicit definition.

– if we start to call our statistical concept with common words like “significance” or “confidence bands” etc then, logically, we are making implicit suggestion that our p-values or alpha levels really have some relation with “significance” or “confidence”. Such a suggestion is a philosophical speculation by its nature and it’s leads to consequences: using broad concept and common word implying very specific (and not common) meaning IS confusing.

What we are really doing when we are talking about “statistical significance”, for example:
– first, we are making implicit assumption (without any prove, by the way) that our alpha levels or p-values have any connection with such thing as “significance” (that is not strictly defined, by the way)
– then we claim that our new born “significance” has a special statistical meaning, and it is not what “practical” or “common sense” “significance” means. And trying to explain people, that by “significance” we mean not “significance” but the one and very specific special meaning of “significance”.

As for me, this is just a failed attempt to use common word as a term for specific scientific concept. No need. People can not understand anyway and only get confused more, having some sort of illusion of understanding.

It’s like a
– “OMG what are all these strange symbols and numbers?”
– “Well… it has some relation to significance… ”
– “Wow, so simple, now i get it!”

Reply

July 22, 2015

phaneron0

Mayo:

I put something on Andrew’s blog that could have perhaps gone here – http://andrewgelman.com/2015/07/21/a-bad-definition-of-statistical-significance-from-the-u-s-department-of-health-and-human-services-effective-health-care-program/#comment-228456

I fear knk might be correct in that for most people they wont put in the work required to get a pragmatic (purposeful) grasp of the concept rather than just a technically correct definition of it.

You are likely correct that people can learn – but to me it would be necessary for them to at least grasp why and when the distribution of _values is or is not Uniform(0,1) and even some statisticians seem to have not grasped that….

Keith O’Rourke

Reply

July 22, 2015

Mayo

Phaneron0: Thanks for putting your remark here, I wish Gelman would have as well, so people can see what he’s saying. My feeling is that these supposedly very hard to grasp notions are intuitively obvious and readily clarified PROVIDED one doesn’t foist a “probabilist” view down people’s throats, or that one isn’t already brainwashed. Probabilism, as I’m using that term (unlike how Taleb used it in a tweet the other day) holds that the role of probability in inference is to assign a degree of probability, support, belief and the like to statistical hypotheses. As a Peircean, you’ll know that CP was strongly against probabilism and strongly in favor of the view of induction as severe testing.Most importantly, CP zeros in on the confusion between deduction and induction at the heart of probabilism. The same point is in Popper who I now think obtained it from Peirce (having found a reference I’d never seen before, in which Popper describes Peirce as one of the greatest philosophical thinkers of all time. not the exact words but very nearly).
A similar discussion may be found in Fisher, by the way.

Reply

July 22, 2015

phaneron0

> A similar discussion may be found in Fisher, by the way.
I do believe Fisher borrowed heavily from Peirce but I thought there was no written record of that (while there is a written record for Popper e.g. Brent’s biography.)

> you’ll know that CP
I do think its risky for people to think they know what Peirce meant but confusion between deduction and induction [as you put it] at the heart of probabilism seems fairly clear and why I would almost never take posterior probabilities literally (as they are obtained deductively).

Here (in the suggested animation) I am just trying to give a moving picture of what p_values are and how they might or might nor serve various purposes….

Reply

July 22, 2015

Mayo

Phan (if I may): Why do you think Fisher borrowed from Peirce? What matters is not so much what Peirce meant (though I would agree that in this case it’s rather clear) but, rather, understanding enough of what he meant to dig for gold in his work.
Where’s the animation? Sorry if I missed.

Reply

July 27, 2015

phaneron0

Mayo: I have been off the web.

> understanding enough of what he [Peirce] meant to dig for gold in his work.
Probably the best strategy.

> the animation?
I was just thinking of what to do, G. Cumming has already done a fair bit along these lines http://andrewgelman.com/2015/07/21/a-bad-definition-of-statistical-significance-from-the-u-s-department-of-health-and-human-services-effective-health-care-program/?replytocom=229520#respond

I don’t like the “bad hammer, good screw driver language” but they are worth looking at.

Keith O’Rourke

Reply

July 27, 2015

john byrd

They seem to have come off the rails over there… Some strange depictions of frequentists’ reasoning.

July 22, 2015

coreyyanofsky

“Intuitively obvious” is going too far, as any number of sufferers… sorry, students in undergraduate stats courses for scientists and engineers can attest. No small number of them would object to “readily clarified” too. Not defending probabilism (here), just saying, as someone who had no difficulty in undergrad grasping the reasoning behind p-values, that you’re underestimating how difficult this stuff is on first acquaintance.

Reply

July 22, 2015

Mayo

Corey: I sometimes call philosophers “philosufferers”, and when I taught stat at Wharton (as a Ph.D student), I recall some called the subject “sadistics”, which is pretty clever. I have definitely seen graduate students (in philosophy of statistics type seminars) have difficulty expressing statistical notions carefully, which is different from not grasping them. The ideas are familiar to anyone skeptical of inflated “just so stories”. By the way, I find analogous points of difficulty in students of symbolic logic. There are certain fallibilities we have to call special attention to, but we don’t propose changing definitions simply because the ordinary English uses of terms (valid, conditional, argument, truth, model, sound, complete) differ from the logical ones.

Reply
July 23, 2015

Richard D. Morey

Mayo: “I’m not sure they wouldn’t be as or more confused with the probability of a type I error, but it would be OK so long as they reported the actual type I error, which is the P-value. ”

What did you mean by this? How is the p value the “actual type I error”?

Reply

July 23, 2015

Mayo

Richard: By definition! See https://errorstatistics.com/2014/08/17/are-p-values-error-probabilities-installment-1/

Reply

July 23, 2015

richarddmorey

hmmm, I guess I understand the “report” and “actual” to mean something different than you (or I’m misunderstanding your meaning). We do not *know* the Type I error rate, so we can’t exactly report any actual rate. The “actual” Type I error rate, could, in fact, be 0 (or undefined, depending on how you look at it) if the null hypothesis were false. Moreover, if we take the definitions from your link it seems like the reference class is future hypothetical experiments in which the null hypothesis is true; otherwise it doesn’t make sense. It doesn’t make sense as a report of anything “actual” for this experiment.

Reply

July 23, 2015

Mayo

Richard: No, you’re confused about the meaning of the error probabilities associated with frequentist procedures. (Possibly applying a construal Bayesians may use.) Error statistical procedures have error probabilities, yes? They are defined mathematically now, in reference to a sampling distribution (of a test statistic), and do not depend on any actual repetitions in the future. I think it was my use of the word “actual” that got you confused. It only referred to the “attained” significance level or P-value, rather than a pre-designated cut-off (see the Lehmann-Romano quote).As discussed in the post that I linked to, the error probabilities associated with tests are hypothetical. That doesn’t mean they can’t also refer to what would actually occur in repetitions satisfying given requirements.

Reply

July 23, 2015

richarddmorey

It was, indeed, the use of the word “actual” that I found confusing.

Reply

July 23, 2015

Mayo

Someone twittered me, in relation to the query Richard raised, that paper by Hubbard and Bayarri:

Click to access tr14-03.pdf

We’ve discussed it, and every issue they raise, several times before. I wish people would dare to THINK before just repeating the same lines others have given. Unfortunately, these points involve little statistics but a lot of philosophy and historical interpretation. If you follow the links from the “p-values are not error probabilities” post, to the detailed discussions of what is behind those hackneyed remarks*, repeated verbatim ad nauseam and over and over again…, then you can think it through yourself.

*Fisher was livid at Neyman for not using his stat book (yeah there was a paper in 1935 on exper design they disagreed on too) when teaching in the same building, and increasingly pretended he’d never made those behavioristic remarks, upon which a few people have built a completely misleading picture of the relationship between N-P and Fisherian statistics, and all of the sheep just follow.
Of relevance:
https://errorstatistics.com/2014/02/15/fisher-and-neyman-after-anger-management-2/
https://errorstatistics.com/2015/04/18/neyman-distinguishing-tests-of-statistical-hypotheses-and-tests-of-significance-might-have-been-a-lapse-of-someones-pen/
https://errorstatistics.com/2012/08/16/e-s-pearsons-statistical-philosophy/

Reply
Pingback: Friday links: Price = d’Alembert, the first null model war, and more | Dynamic Ecology
Pingback: Canadian Judges’ Reference Manual on Scientific Evidence | Schachtman Law
Pingback: Likely, unlikely, certain and impossible – AiProBlog.Com

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

“Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)

Post navigation

69 thoughts on ““Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018. All Rights Reserved.

“Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)

Related

Post navigation

69 thoughts on ““Statistical Significance” According to the U.S. Dept. of Health and Human Services (ii)”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018. All Rights Reserved.