Someone linked this to me on Twitter. I thought it was a home blog at first. Surely the U.S. Dept of Health and Human Services can give a better definition than this.

U.S. Department of Health and Human Services

Effective Health Care Program

Glossary of TermsWe know that many of the concepts used on this site can be difficult to understand. For that reason, we have provided you with a glossary to help you make sense of the terms used in Comparative Effectiveness Research. Every word that is defined in this glossary should appear highlighted throughout the Web site…..

Statistical Significance

Definition:A mathematical technique to measure whether the results of a study are likely to be true.Statistical significanceis calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

Example:For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to bestatistically significantbecause p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

You can find it here. First of all, one should never use “likelihood” and “probability” in what is to be a clarification of formal terms, as these mean very different things in statistics.Some of the claims given actually aren’t so bad if “likely” takes its statistical meaning, but are all wet if construed as mathematical probability.

What really puzzles me is, how do they expect readers to understand the claims that appear within this definition? Are their meanings known to anyone? Watch:

**Statistical Significance **

- A mathematical technique to measure whether the results of a study are likely to be true.

**What does it mean to say “the results of a study are likely to be true”?**

*Statistical significance*is calculated as the probability that an effect observed in a research study is occurring because of chance.

**Meaning?**

- Statistical significance is usually expressed as a P-value.
- The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).

**How should we define “more likely that the results are true”?**

- Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

**oy, oy**

- The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

**Oy, oy, oy ****OK, I’ll turn this into a single “oy” and just suggest dropping “probably” (leaving the hypertext “probability”). But this was part of the illustration, not the definition.**

Surely it’s possible to keep to their brevity and do a better job than this, even though one would really want to explain about the types of null hypotheses, the test statistic, the assumptions of the test (we aren’t told if their example is an RCT.) I’ve listed how they might capture what I think they mean to say, off the top of my head. Submit your improvements, corrections and additions, and I’ll add them. Updates will be indicated with (ii), (iii), etc.

**Statistical Significance**

- A mathematical technique to measure whether the results of a study are likely to be true.

a) A statistical technique to measure whether the results of a study indicate the null hypothesis is false, that some*genuine*discrepancy from the null hypothesis exists.

*Statistical significance*is calculated as the probability that an effect observed in a research study is occurring because of chance.

a) The statistical significance of an observed difference is the probability of observing results as large as was observed, even if the null hypothesis is true.

b) The statistical significance of an observed difference is how frequently even larger differences than were observed would occur (through chance variability), even if the null hypothesis is true.

- Statistical significance is usually expressed as a P-value.

a) Statistical significance may be expressed as a P-value associated with an observed difference from a null hypothesis*H*_{0}within a given statistical test T.

- The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).

a) The smaller the P-value, the less consistent the results are with the null hypothesis, and the more consistent they are with a genuine discrepancy from the null.

b) The smaller the P-value, the greater the probability a smaller value of the test statistic would have occurred, were the data from a world where the null hypothesis is adequate.

- Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

a) Researchers generally regard the results as inconsistent with the null hypothesis if statistical significance is less than 0.05 (p<.05).

- (Part of the illustrative example): The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

a) The probability that even larger differences would occur due to chance variability (even if the null is true) is high enough to regard the result as consistent with the null being true.

**7/17/15 remark:** Maybe there’s a convention in this glossary that if the word is not in hypertext, it is being used informally. In that case, this might not be so bad. I’d remove “probably” to get:

b) The probability that the results were due to chance was high enough to conclude that the two drugs did not differ in causing blood pressure problems.

**7/17/15:** In (ii) In reaction to a comment, I replaced d_{obs} with “observed difference”, and cut out Pr(d ≥ d_{obs} ;*H*_{0}). I also allowed that #6 wasn’t too bad, especially if (the non-hypertext) “probably” is removed. The only thing is, this was *not* part of the definition, but rather the illustration. So maybe this could be the basis for fixing the others in the definition itself.

** **

Sorry, but you’ll lose the general public at the first appearance of “D-obs” or if not then, then at the first mention of “null hypothesis”. Yes, their explanation is flawed if viewed by a statistician and yes, it’s OK when it comes to general public explanations since you have to cater such messages to the lowest common denominator.

Statistical methods therefore have a unique status within science of being heavily used by people that have no understanding of what they are actually doing.

Agreed. Same goes for the world outside of science that tries to use statistics, for example Marketing, which is something I’m focusing on.

Geo: This glossary was to be part of an interconnected set of definitions, and my putting dobs as a shorthand for the observed difference is irrelevant, I’ll even take it out. It’s THEIR claims that I think NO-ONE can understand because they have no meaning, and some of them are utterly foreign to any terminology I’ve ever seen. Can you parse them to arrive at anything like a correct definition?

If you ask a layman to read it he’ll roughly get the idea of stat significance, probably with the usual misunderstandings (esp. if you probe deeper). However there is a point beyond which a concept can’t be expressed in a simpler manner without losing it’s true/full meaning and the level of this explanation is past that point. In fact, most other explanations on the site are of 1 or 2 sentences so this one is actually unusually lengthy.

Also: probability and likelihood are completely interchengable in everyday language (e.g. “like·li·hood (līk′lē-ho͝od′)

n.

1. The state of being probable; probability.

2. Something probable.” ). When they are not mingled as above they are listed as synonims.

The 0.05 p value point serves as a somewhat useful reference-point/example for the complete novice.

Geo: I don’t think a layman would get it. By the way, their other definitions, the ones I looked at, aren’t nearly as distorted and don’t try to dumb things down either.

In my view, the funniest thing, or, depending on my mood, the most worrying, is the wording “more likely that the results are true”.

I observe that sometimes students seem worried that something went wrong if things don’t come out significant. Although I kind of know the culture that breeds such ideas, this always felt a bit bizarre to me, but in this definition it’s just outrageous.

Christian. Yes, what in the world could it mean for the results to be “true” as oppose to false maybe?

What about their example demonstrating how to interpret a confidence interval?

“For example, a study shows that the risk of heart attack from a drug is 3 percent (0.03). The confidence interval is shown as “95% CI: 0.015, 0.04.” This means that if you conduct this study on 100 different samples of people, the risk of heart attack in 95 of the samples will fall between 1.5 percent and 4 percent. We are 95 percent confident that the true risk is between .015 and .04.”

This looks to my lay (but hopefully soon to be enlightened) eyes as being what my son would call a FAIL. Thoughts?

David: Oy, I didn’t look at that, but will.

Am I the only one to feel that the absence of the word “evidence” in this discussion is indicative of the most epic failure imaginable on the part of statistics and philosophy?

Both the general public and researchers are, or should be, interested in the evidential meaning of the data, and surely P-values have something to do with that evidence. Our inability to talk about evidence in an acceptable manner is what stops us from being able to explain P-values to each other.

Michael: Well I don’t know if it’s a failure of philosophy (italics under “failure”, since philosophers don’t tend to address statistical evidence as it occurs in practice*), but I entirely agree that we should be talking about evidence. One thing: since we’re given this without background, and without nuance or context, the weakest construal seems warranted. An isolated statistically significant result, I often say, “indicates” such and such, and only when it passes an “audit” would I say there’s evidence. Evidence of a genuine effect, as Fisher insisted, requires more than an isolated low P-value. To reflect the stronger, evidential claim, I’d give the following “b” versions to #1 and #5:

#1 b) A statistical technique to measure whether the results of a study are evidence that there is a genuine effect, and thus that the null hypothesis is false.

#5 b) Researchers generally regard the results as evidence of inconsistency with the null if statistical significance is less than 0.05

*except to concede as Peter Achinstein does, that philosophical notions of evidence are irrelevant to scientists because they are a priori. But never mind all that.

Your new 1b and 5b are improvements. They do seem to imply that evidence has the binary, all or none, property of existing or not existing. My own conception of evidence is more interesting than that: evidence can vary in degree, in convincingness, in specificity and in target, and I don’t think that my conception differs markedly from standard intuition.

Michael: Well my notion of evidence for a claim C has degrees too, according to things like precision, accuracy and how severely tested C is. I was trying to improve on some “definitions” that are close to meaningless or misleading, without being told it’s too nuanced or technical for this glossary. Here, inconsistency is at a level. Perhaps your “convincingness” is akin to my “well-tested”. But I don’t want evidence to reflect rhetorical or psychological persuasive power, which is what “convincingness” sounds like, but epistemic-evidential warrant or corroborative force.

Please give me a full alternative statement under any of 1-6 and I’ll post it along with your name. A crowd sourced definition, you might say. Likely to be better than crowd-sourced replication attempts. I can hear the booing. Just kidding, …mostly.

Oh Canada!

Take a look at the bottom paragraph on p. 74 of

https://www.nji-inm.ca/index.cfm/publications/science-manual-for-canadian-judges/?langSwitch=en

Sander: So who wrote this? Some of it isn’t so bad, except for the tone. However, it is inconsistent: all or nothing, and 95% as a high standard of proof. Please give the background scoop on this Canadian book for judges, or why it strikes you as akin to the dept. of Health and human services glossary.

> which we _interpret_ as the null hypothesis is …

That is a notable point being made.

So ya, Sander: So who wrote this?

Keith

Your guess about the author is as good as mine – in fact better than mine for Kieth.

So you have no problem with statements like alpha=0.05 corresponds to a “scientific attitude that unless we are 95% sure the null hypothesis is false, we provisionally accept it”?

Or that “a 95% threshold constitutes a rather high standard of proof.”?

I am a bit surprised there is a 3 day course and manual on science for Canadian judges.

When I get back from vacation, I’ll look into how it was put together (I am sure they will be open to criticism of their work).

It would nice to have a least wrong write of this for those not in an intro statistics course (i.e. self contained and readable.)

Kieth

Sander: No I do have a problem, just that it’s different from the ones I have with the glossary. I’m also put off by the tone. I will look at some of the rest of it. I mentioned it to Schachtman.

Mayo, Sander,

Thanks for the reference. I was unaware of the NJI volume. I see that Joe Cecil of the Federal Judicial Center, and Brian Baigrie of the Jackman Institute in Toronto had some peer review responsibilities for this document. I will have to ask them about the work.

I have not read it carefully, but there seem to be some howlers on basic definitions, and some rather idiosyncratic, Feyerabend-like views in the manual. The U.S. version, the Reference Manual on Scientific Evidence, which may have inspired the Canadian text, is put out jointly by the National Research Council and the Federal Judicial Center. The Reference Manual has a chapter dedicated solely to statistics by David Kaye and the late David Freedman (and others to regression analyses, epidemiology, clinical medicine, etc.). For all its shortcomings, the Reference Manual seems like a much more solid production.

Nathan

Nathan: Where are the Feyerabend-like views? I’ve only looked at the part Sander called our attention to.

Pingback: Distilled News | Data Analytics & R

In skimming the index, I saw entries for the myth of objectivity, etc. Of course, this would be inconsistent with the stated mission of helping judges discern weak from strong scientific claims. I have not read any of the text except a few sentences around a search for 95%, which revealed some of the disturbing language that Sander referenced.

Nathan: Actually I find nothing to disagree with in the “myth of objectivity” section although they should not have called it that. However, I came upon a discussion of the difference between Bayesian and frequentist probabilities which commits a serious error concerning adjustment for selection(p. 131). It concerns DNA matching through data bases. Cox and Mayo discuss it on p. 270 of our paper. http://www.phil.vt.edu/dmayo/personal_website/Ch%207%20mayo%20&%20cox.pdf

The manual says the frequentist way to deal with fishing in a database is to add up the probability of a random match in each case to get a probability of at least one random match, as in Np. This is not a probability but a number of individuals. We want to use: (1 – (1- p)^N) to get the desired probability of at least one random match.

I also think the whole depiction of scientific reasoning for frequentists and Bayesians is confusing. The presentation is circular in saying 1) we get a match for a crime scene sample while fishing in the database, then 2) make the unlucky person who matched the sample a suspect and take a DNA sample from him/her, then 3) compare a new DNA sample from our new “suspect” to the crime scene sample for confirmation. First, assuming we are using the same test (autosomal STRs, say), then of course there will be a second match unless the DNA labs made an error,which is rare. We might find at this stage that the proper estimate of at least one random match is very small, because we are comparing one person’s complete profile to a complete profile. However, when there are partial profiles from degraded samples the RMP might be much larger. The only way to go beyond the Inference possible using the probability of at least one match in the fishing exercise is to do a different, independent DNA test, such as Y-STRs or mtDNA which allows you to test anew the hypothesis that the person’s fluids were in the crime scene. If one or more independent tests match as well, the error probabilities combine to be very small. Any mis- match is an exclusion.

John: I’ll read this more carefully. The point I was making is that it’s just mistaken to allege that frequentists pay a penalty in this type of searching for a known effect. Explaining a known effect, especially with a reliable method, has a completely different logic. It’s like the example I happened to be talking to Schachtman about yesterday: searching for an animal in which to show the teratogenic effects of thalidomide. They finally found the New Zealand rabbit. Or, finding my keys as in that cartoon I once posted.

https://errorstatistics.com/2012/07/21/always-the-last-place-you-look/

I agree frequentists pay no penalty. The error probabilities must come from the right calculation as with any other application. I will also say that the statements about the Bayesian approach do not make sense to me (and I think I do understand the Bayesian reasoning used in practice).

John: I have tried to give the logic and rationale to direct the correct calculation. I don’t know that anyone has tried to make these distinctions at all systematic.

Incidentally, I was looking at your commingled remains chapter last night. I thought you had some resampling exs, maybe elsewhere.

You will see some bootstrap results for an omnibus statistic in that chapter.

Fishing in pools of potential suspects can be like what you have called cherry-picking. It is a great tool for discovery, but one must make sure the error probabilities are properly determined before flipping the result into confirmation. This is true for the frequentists and Bayesians (where the likelihoods include a random match prob in the denominator).

John: What’s the random match probability in the denominator?

The likelihood component usually compares the probability of the match, given the scene sample came from the suspect (approx 1.0) to the probability the scene sample is a random match. Random match probabilities can be very small when comparing DNA from an individual to a sample possibly from the same individual.

Yeah; I agree that the “headline” seems misleading, and it is certainly out of keeping with the mission statement of the volume. I need to sit down and read it carefully, and sympathetically, which won’t happen for a while now.

Andrew Gelman blogs my blog today with some comments of his own: http://andrewgelman.com/2015/07/21/a-bad-definition-of-statistical-significance-from-the-u-s-department-of-health-and-human-services-effective-health-care-program/

Haven’t read it yet, he just sent me the link.

I wish Gelman wouldn’t just say tests of hypotheses, or significance tests, are just bad, bad, bad and should never be used. That’s not so, even if in most (but not all ) cases they should be supplemented with CIs or, even better, a severity analysis (to begin with). The howlers he gives generally allude to fringe sciences or bad statistics (or both). In the case of the former, there is no way that any improved statistics can improve their inference, but at most can identify where they’re illicit. Guess what tools we turn to in order to identify this illicitness? Yes, significance tests. One should always ask: what superior methods has he given? To just trash an entire methodology that is, for example, at the heart of randomized treatment control trials in medicine, and without which there are no tests of model assumptions, only encourages the kind of mindlessness and name-calling* that we should be fighting.

*As in: ban significance tests rather than use them correctly for what they’re intended to do, or blame the tools when its lack of self-criticism of the user that’s to blame.

Here’s a spoof on “statistical task forces”: https://errorstatistics.com/2015/01/31/2015-saturday-night-brainstorming-and-task-forces-1st-draft/

my guess:

“IF someone is not sure it is possible to use significance test CORRECTLY for any reason,

then better don’t use and make a note: “significance unknown”

Otherwise, if someone making concrete claims about signifcance levels, CIs etc., than he MUST PROVE the correctness of the test used/results etc.

Kogdato: Oh right, like we’re ever able to PROVE our inquiries will result in CORRECT claims or will not land in error for any reason. That’s absurd. In the case of statistics, the best we can do is probe for flaws, and in order to do that we will find ourselves turning to significance tests.

This was the remark, quite puzzling given his own use of P-values:

“I don’t think people should be making decisions based on statistical significance in any case. In my ideal world, we’d be defining statistical significance just as a legacy project, so that students can understand outdated reports that might be of historical interest.”

If you burned all the error statistical books, they’d have to be reinvented, if we’re to learn about certain types of variable phenomena. There’s no essential difference between using N-P statistics, CI intervals, significance tests, or general error statistical methods–they are appropriate for different uses and solving different problems. They may be unified in many different ways. My own favorite is by using them to formalize the notions of a severe test, corroboration, and self-correcting intended by Popper and Peirce, respectively.

Mayo, completely agree! In fact, in RCTs, I strongly believe (and can provide many, many references from the RCT literature) that the significance test is of primary importance. Models, estimated “effect sizes”, confidence intervals, etc., are certainly of interest, but they are secondary to the test.

Mark: I would love to have a few references from you, whenever you get around to it, or if they’re in a source you can mention. Getting the refs from someone “on the inside” is best. I’d like to include a few, also,in my book. Thanks so much.

Mark wrote: “Mayo, completely agree! In fact, in RCTs, I strongly believe (and can provide many, many references from the RCT literature) that the significance test is of primary importance.”

I have difficulty thinking of a case this could be justified. A significance test (as applies to an RCT) only tells you the data in column A is different on average than the data in column B.

1) Without effect size it is not possible to sanity check the outcome.

-No experiment is perfect, and those with humans are very difficult to control. Anything that “goes wrong” will invalidate the assumptions of the significance test.

2) Even the most well designed and implemented RCT will never contain subjects that are EXACTLY the same at baseline or experience exactly nothing that may affect the outcome during the course of the study.

– It really is implausible that two samples came from the same hypothetical infinite population.

– Did the people running the RCT *really* randomize everything? E.g. the order pipettes, scalple, etc are used? Usually this is not practical and the answer is no.

3) Even if a difference is due to the treatment and it is substantial, that is not enough to say the drug should be used or the theory that motivated the trial is correct.

– Unknown to the researcher, an anti-cancer drug may work by killing gut cells leading to caloric restriction. It would be much cheaper and safer to have patients eat less than to give them the drug.

– An animal trial of a drug meant to improve intelligence may test the animal on some food acquisition task. Of this drug makes the animal hungrier they may be more motivated and so appear to be “smarter” than controls. Is this likely to translate to situations humans care about?

– The outcome measure may not be exactly what is important to treatment decisions (eg years of survival vs quality of life for a drug with bad side effects)

In most cases, “chance” is really shorthand for “too insubstantial to be worth further work at this time”. So why not just look directly at the observed difference, you needed to calculate it to get that p-value anyway?

Statistical significance (as used in RCTs) does not tell us anything in addition to what we would gather from looking at the distributions of results and discussing the multiple possible explanations for the differences. Of course, the arbitrary cut-off and all the misuses just make these problems work.

Anon: Get thee immediately to Stephen Senn’s post (which I’m about to reblog): https://errorstatistics.com/2012/07/09/stephen-senn-randomization-ratios-and-rationality-rescuing-the-randomized-clinical-trial-from-its-critics/

Mayo, sorry, I’ve been busy today teaching about, ironically, RCT design. I have some initial references in mind and will post them later this afternoon (for a starter, I’d suggest, again ironically, this paper of Senn’s that I linked to on that post that you linked to: http://www.ncbi.nlm.nih.gov/pubmed/7997705 — it’s not *exactly* about what I was saying above, but it describes very well the foundation for it — although I’m not sure that Senn would completely agree with me on the super-primacy of the hypothesis test over any estimate of effect size).

Anon: Your point 1 is a good, and quite real, one, but to me it doesn’t make sense to address those issues by simply making additional strong, untestable assumptions. For randomization inference, we need one (mostly untestable) assumption — non-differential loss. Note that this doesn’t simply mean similar numbers lost across groups, it’s much stronger than that.

Your second point is, unfortunately, completely irrelevant — see Senn’s post that Mayo linked to and the paper that I linked to. Your third point, while true, is also irrelevant. RCTs are not mechanistic studies — that is, randomization won’t tell us why a drug works, only that it seems to have had different effects, relative to the comparator, for at least some of those randomized.

Mark wrote: ” Your point 1 is a good, and quite real, one, but to me it doesn’t make sense to address those issues by simply making additional strong, untestable assumptions… RCTs are not mechanistic studies — that is, randomization won’t tell us why a drug works, only that it seems to have had different effects, relative to the comparator, for at least some of those randomized.”

So you agree that a plausible assumption is that 100% of RCTs are imperfect? In that case, how can you conclude the difference is due to the drug? The real answer, which is used informally by people who have seen the jungle of error it takes to run one is: because the effect size is large enough that it would require great levels of imperfection to cause it. The assumption is that this would have been noticed.

My point two is not really addressed by the Senn post. Say you have a population of 50% males and 50% females and take a sample. As sample size increases both the probability of very unbalanced and exactly balanced samples decreases. You will always have larger between group variance than within group variance if there are any subsets.

But really it is point 3 that matters. The outcome of a significance test after an RCT can range from extremely to not at all informative. We cannot even guess without additional information. That is why I believe it is a spurious data reduction step.

Hi Mayo,

Here are some references (most of which you undoubtedly already know), off the top of my head:

First, of course, there’s Fisher in DOE, which is of course all about testing (“every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis”, and so on).

Thomas Cook and David DeMets have a fabulous book from 2008 called Introduction to Statistical Methods in Clinical Trials, in the Preface for which they lay out their 3 underlying principles for randomized trials. Their second principle is that “RCTs are primarily hypothesis testing instruments. While inference beyond simple tests of the [study] hypotheses is clearly essential for a complete understanding of the results, we note that virtually all design features of an RCT are formulated with hypothesis testing in mind. … Even in the simplest situations, however, estimation of a ‘treatment effect’ is inherently model-based, dependent on implicit model assumptions, and the most well conducted trials are subject to biases that require that point estimates and confidence intervals be viewed cautiously.” Personally, I completely agree with the quoted text.

Then there’s Oscar Kempthorne’s 1977 paper in Journal of Statistical Planning and Inference, pp. 1-25, called “Why Randomize?” Or his 1979 paper in Sankhya pp. 115-145, called “Sampling Inference, Experimental Inference, and Observational Inference.”

David Freedman has papers on regression models (http://www.stat.berkeley.edu/~census/neyregr.pdf), logistic regression models (http://www.stat.berkeley.edu/~census/neylogit.pdf), and proportional hazards regression models (chapter 11 of his posthumously published book Statistical Models and Causal Inference) in experimental studies.

There’s this paper, which is one of my favorites: http://www.ncbi.nlm.nih.gov/pubmed/?term=groundhog+day+cause+and+effect

There’s this one, which is also a good paper: http://www.ncbi.nlm.nih.gov/pubmed/8220408

Hope these are helpful!

so… at least 50 years of explanations “what is statistical significance” in every single textbook on stats… and.. epic fail 😀

still 99% of people can not understand what is it

maybe it would be better to cut…

Knk: but, to my knowledge, textbooks don’t mutilate their defn of stat significance as this glossary does. Any probabilistic notions can and are easily butchered, partly because of the variable uses of “probability” and “likelihood” in ordinary English, and partly because of certain confusions about the role of probability in statistical inference. As you can see, the correct definitions aren’t so far from what was written (once you get into their statspeak), and that’s why I started with their sentences.

I don’t know any notions from formal statistics that are not misinterpreted. I don’t buy this whine, “oh it’s all just too hard to learn.” I think most laypersons grasp the value of randomized control trials in medicine in showing how some “benefits” we’d been sold are due to biases, revealed by statistical significance tests, e.g., hormone replacement therapy. Once I realized they were focusing on that use of tests (which makes sense given the audience), it was easy to clean up the wording.

As for me, textbook definitions are OK in most cases

But the fact is: people just can’t understand what is it.

I am almost sure that the absolute majority of people

understand “significance” in a way US HHS wrote:

“technique to measure whether the results of a study are likely to be true”

I just think the words such as “significance”, “confidence intervals” are just irrelevant and confusing for most people. And it would be better to use some specific terms for such a complex concepts, and not just ordinary words. Logically the exact meaning of “statistical significance” term has not so much in common with what “common sense” calls a “significance”.

What do you mean “the results of a study are likely to be true”?

Error probabilities ARE specific terms for complex concepts, and quite easy to understand correctly, if one wants to.

The word “significance” isn’t necessary for the tests to be understood.

Agreed. Thats what i am trying to say. The word “significance” isn’t necessary. And it would be better to don’t use it at all in the publications. The words such as “significance” or “confidence limits” just confusing the public (and many researchers).

Personally I would prefer to use the words such as “probability of Type I/II error”.

Knk: you mean it would be better not to use it. I’m not sure they wouldn’t be as or more confused with the probability of a type I error, but it would be OK so long as they reported the actual type I error, which is the P-value. The type 2 error, however, requires specifying an alternative, which in my judgment should be done, even if it’s merely directional. There are one or two exceptions. Then what to do with confidence intervals (CIs)? I’ve developed a notion of the severity of a test (in relation to a specified discrepancy) that works for all these error probabilities as well as CIs, and precludes dichotomous reasoning of so-called NHSTs.

The idea is simple:

– if we using something very specific for complex concept (e. g. special terms like “p-value or probability of type I error” or just (and maybe this even better) just an abstract algebraic notation without any “common” words at all) – then there is no confusion possible, because no one can understand what does the special term means without checking the strict explicit definition.

– if we start to call our statistical concept with common words like “significance” or “confidence bands” etc then, logically, we are making implicit suggestion that our p-values or alpha levels really have some relation with “significance” or “confidence”. Such a suggestion is a philosophical speculation by its nature and it’s leads to consequences: using broad concept and common word implying very specific (and not common) meaning IS confusing.

What we are really doing when we are talking about “statistical significance”, for example:

– first, we are making implicit assumption (without any prove, by the way) that our alpha levels or p-values have any connection with such thing as “significance” (that is not strictly defined, by the way)

– then we claim that our new born “significance” has a special statistical meaning, and it is not what “practical” or “common sense” “significance” means. And trying to explain people, that by “significance” we mean not “significance” but the one and very specific special meaning of “significance”.

As for me, this is just a failed attempt to use common word as a term for specific scientific concept. No need. People can not understand anyway and only get confused more, having some sort of illusion of understanding.

It’s like a

– “OMG what are all these strange symbols and numbers?”

– “Well… it has some relation to significance… ”

– “Wow, so simple, now i get it!”

Mayo:

I put something on Andrew’s blog that could have perhaps gone here – http://andrewgelman.com/2015/07/21/a-bad-definition-of-statistical-significance-from-the-u-s-department-of-health-and-human-services-effective-health-care-program/#comment-228456

I fear knk might be correct in that for most people they wont put in the work required to get a pragmatic (purposeful) grasp of the concept rather than just a technically correct definition of it.

You are likely correct that people can learn – but to me it would be necessary for them to at least grasp why and when the distribution of _values is or is not Uniform(0,1) and even some statisticians seem to have not grasped that….

Keith O’Rourke

Phaneron0: Thanks for putting your remark here, I wish Gelman would have as well, so people can see what he’s saying. My feeling is that these supposedly very hard to grasp notions are intuitively obvious and readily clarified PROVIDED one doesn’t foist a “probabilist” view down people’s throats, or that one isn’t already brainwashed. Probabilism, as I’m using that term (unlike how Taleb used it in a tweet the other day) holds that the role of probability in inference is to assign a degree of probability, support, belief and the like to statistical hypotheses. As a Peircean, you’ll know that CP was strongly against probabilism and strongly in favor of the view of induction as severe testing.Most importantly, CP zeros in on the confusion between deduction and induction at the heart of probabilism. The same point is in Popper who I now think obtained it from Peirce (having found a reference I’d never seen before, in which Popper describes Peirce as one of the greatest philosophical thinkers of all time. not the exact words but very nearly).

A similar discussion may be found in Fisher, by the way.

> A similar discussion may be found in Fisher, by the way.

I do believe Fisher borrowed heavily from Peirce but I thought there was no written record of that (while there is a written record for Popper e.g. Brent’s biography.)

> you’ll know that CP

I do think its risky for people to think they know what Peirce meant but confusion between deduction and induction [as you put it] at the heart of probabilism seems fairly clear and why I would almost never take posterior probabilities literally (as they are obtained deductively).

Here (in the suggested animation) I am just trying to give a moving picture of what p_values are and how they might or might nor serve various purposes….

Phan (if I may): Why do you think Fisher borrowed from Peirce? What matters is not so much what Peirce meant (though I would agree that in this case it’s rather clear) but, rather, understanding enough of what he meant to dig for gold in his work.

Where’s the animation? Sorry if I missed.

Mayo: I have been off the web.

> understanding enough of what he [Peirce] meant to dig for gold in his work.

Probably the best strategy.

> the animation?

I was just thinking of what to do, G. Cumming has already done a fair bit along these lines http://andrewgelman.com/2015/07/21/a-bad-definition-of-statistical-significance-from-the-u-s-department-of-health-and-human-services-effective-health-care-program/?replytocom=229520#respond

I don’t like the “bad hammer, good screw driver language” but they are worth looking at.

Keith O’Rourke

They seem to have come off the rails over there… Some strange depictions of frequentists’ reasoning.

“Intuitively obvious” is going too far, as any number of sufferers… sorry, students in undergraduate stats courses for scientists and engineers can attest. No small number of them would object to “readily clarified” too. Not defending probabilism (here), just saying, as someone who had no difficulty in undergrad grasping the reasoning behind p-values, that you’re underestimating how difficult this stuff is on first acquaintance.

Corey: I sometimes call philosophers “philosufferers”, and when I taught stat at Wharton (as a Ph.D student), I recall some called the subject “sadistics”, which is pretty clever. I have definitely seen graduate students (in philosophy of statistics type seminars) have difficulty expressing statistical notions carefully, which is different from not grasping them. The ideas are familiar to anyone skeptical of inflated “just so stories”. By the way, I find analogous points of difficulty in students of symbolic logic. There are certain fallibilities we have to call special attention to, but we don’t propose changing definitions simply because the ordinary English uses of terms (valid, conditional, argument, truth, model, sound, complete) differ from the logical ones.

Mayo: “I’m not sure they wouldn’t be as or more confused with the probability of a type I error, but it would be OK so long as they reported the actual type I error, which is the P-value. ”

What did you mean by this? How is the p value the “actual type I error”?

Richard: By definition! See https://errorstatistics.com/2014/08/17/are-p-values-error-probabilities-installment-1/

hmmm, I guess I understand the “report” and “actual” to mean something different than you (or I’m misunderstanding your meaning). We do not *know* the Type I error rate, so we can’t exactly report any actual rate. The “actual” Type I error rate, could, in fact, be 0 (or undefined, depending on how you look at it) if the null hypothesis were false. Moreover, if we take the definitions from your link it seems like the reference class is future hypothetical experiments in which the null hypothesis is true; otherwise it doesn’t make sense. It doesn’t make sense as a report of anything “actual” for this experiment.

Richard: No, you’re confused about the meaning of the error probabilities associated with frequentist procedures. (Possibly applying a construal Bayesians may use.) Error statistical procedures have error probabilities, yes? They are defined mathematically now, in reference to a sampling distribution (of a test statistic), and do not depend on any actual repetitions in the future. I think it was my use of the word “actual” that got you confused. It only referred to the “attained” significance level or P-value, rather than a pre-designated cut-off (see the Lehmann-Romano quote).As discussed in the post that I linked to, the error probabilities associated with tests are hypothetical. That doesn’t mean they can’t also refer to what would actually occur in repetitions satisfying given requirements.

It was, indeed, the use of the word “actual” that I found confusing.

Someone twittered me, in relation to the query Richard raised, that paper by Hubbard and Bayarri:

We’ve discussed it, and every issue they raise, several times before. I wish people would dare to THINK before just repeating the same lines others have given. Unfortunately, these points involve little statistics but a lot of philosophy and historical interpretation. If you follow the links from the “p-values are not error probabilities” post, to the detailed discussions of what is behind those hackneyed remarks*, repeated verbatim ad nauseam and over and over again…, then you can think it through yourself.

*Fisher was livid at Neyman for not using his stat book (yeah there was a paper in 1935 on exper design they disagreed on too) when teaching in the same building, and increasingly pretended he’d never made those behavioristic remarks, upon which a few people have built a completely misleading picture of the relationship between N-P and Fisherian statistics, and all of the sheep just follow.

Of relevance:

https://errorstatistics.com/2014/02/15/fisher-and-neyman-after-anger-management-2/

https://errorstatistics.com/2015/04/18/neyman-distinguishing-tests-of-statistical-hypotheses-and-tests-of-significance-might-have-been-a-lapse-of-someones-pen/

https://errorstatistics.com/2012/08/16/e-s-pearsons-statistical-philosophy/

Pingback: Friday links: Price = d’Alembert, the first null model war, and more | Dynamic Ecology

Pingback: Canadian Judges’ Reference Manual on Scientific Evidence | Schachtman Law

Pingback: Likely, unlikely, certain and impossible – AiProBlog.Com