When it comes to the statistics wars, leaders of rival tribes sometimes sound as if they believed “les stats, c’est moi”. [1]. So, rather than say they would like to supplement some well-known tenets (e.g., “a statistically significant effect may not be substantively important”) with a new rule that advances their particular preferred language or statistical philosophy, they may simply blurt out: “**we take that step here!**” followed by whatever rule of language or statistical philosophy they happen to prefer (as if they have just added the new rule to the existing, uncontested tenets). Karan Kefadar, in her last official (December) report as President of the American Statistical Association (ASA), expresses her determination to call out this problem at the ASA itself. (She raised it first in her June article, discussed in my last post.)

One final challenge, which I hope to address in my final month as ASA president, concerns issues of significance, multiplicity, and reproducibility. In 2016, the ASA published a statement that simply reiterated what

p-values are and are not. It did not recommend specific approaches, other than “good statistical practice … principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean.”The guest editors of the March 2019 supplement to

The American Statisticianwent further, writing: “The ASA Statement on P-Values and Statistical Significancestopped just short of recommending that declarations of ‘statistical significance’ be abandoned.We take that step here. … [I]t is time to stop using the term ‘statistically significant’ entirely.”Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was co-authored by an official of the ASA. In fact, the ASA does not endorse any article, by any author, in any journal—even an article written by a member of its own staff in a journal the ASA publishes. (Kafadar, December President’s Corner)

Yet Wasserstein et al. 2019 describes itself as a *continuation* of the ASA 2016 Statement on P-values, which I abbreviate as ASA I. (Wasserstein is the Executive Director of the ASA.) It describes itself as merely recording the decision to “take that step here”, and add one more “don’t” to ASA I. As part of this new “don’t,” it also stipulates that we should not consider “at all” whether pre-designated P-value thresholds are met. (It also restates four of the six principles in ASA I so as to be considerably stronger than those in ASA I. I argue, in fact, the resulting principles are inconsistent with principles 1 and 4 of ASA I. See my post from June 17, 2019.) Since it describes itself as a continuation of the ASA policy in ASA I, and that description survived peer review at the journal TAS, readers presume that’s what it is; absent any disclaimer to the contrary, that conception (or misconception) remains operative.

There really is no other way to read the claim in the Wasserstein et al. March 2019 editorial: “*The ASA Statement on P-Values and Statistical Significance *stopped just short of recommending that declarations of ‘statistical significance’ be abandoned.[2] We take that step here.” Had the authors viewed their follow-up as anything but a continuation of ASA I, they would have said something like: “Our own recommendation is to go *much further* than ASA I. We suggest that all branches of science stop using the term ‘statistically significant’ entirely.” They do not say that. What they say is written from the perspective of “Les stats, c’est moi”.

**The 2019 P-value Project II**

Kafadar deserves a great deal of credit for providing some needed qualification in her December note. However, there needs to be a disclaimer by ASA as regards what it calls its **P-value Project**. The P-value project, started in 2014, refers to the overall ASA campaign to provide guides for the correct use and interpretation of P-values and statistical significance, and journal editors and societies are to consider revising their instructions to authors taking into account its guidelines. ASA I was distilled from many meetings and discussions from representatives in statistics. The only difference in today’s P-value Project is that both ASA I *and* the 2019 editorial by Wasserstein et al. are to form the new ASA guidelines–even if the latter is not to be regarded as a continuation of ASA I (in accord with Kafadar’s qualification). I will refer to it as the **2019 ASA P-value Project II**.^{(note)} Wasserstein et al. 2019 is a piece of the P-value project, and the authors thank the ASA for its support of this Project at the end of the article. [4] [5]

**Of Policies and Working Groups**

Kafadar continues:

Even our own ASA members are asking each other, “What do we tell our collaborators when they ask us what they should do about statistical hypothesis tests and

p-values?” Should the ASA have a policy on hypothesis testing or on using “statistical significance”?

Allow me to weigh in here: No, no it should not. At one time I would have said yes, but no more. I can hear the policy now (sounding much like Wasserstein et al. 2019, only written in stone): “Don’t say, never say, or if you really feel you must say significance, and are prepared to thoroughly justify such a “thoughtless” term, then you may only say “significance level p” where p is continuous, and never rounded up or cut off, ever. But never, ever use the “ant” ending: signifi* cant. *Y

*ou*can’t, can’t, can’t say results are statistically signifi

*at level p). The only exception would be if you’re giving the history of statistics. (3)*

**cant**(Why can’t the ASA merely provide a bipartisan forum for discussion of the multitude of models, methods, aims, goals, and philosophies of its members? Wasserstein et al. 2019 admits there is no agreement, and that there might never be. Spare us another document whose implication is: we need not test, and cannot falsify claims, even statistically (since that is the consequence of no thresholds). I realize that Kafadar is calling for a serious statement–one that counters the impression of the Wasserstein et al. opinion.

To address these issues, I hope to establish a working group that will prepare a thoughtful and concise piece reflecting “good statistical practice,” without leaving the impression that

p-values and hypothesis tests—and, perhaps by extension as many have inferred, statistical methods generally—have no role in “good statistical practice.” …The ASA should develop—and publicize—a properly endorsed statement on these issues that will guide good practice.

Be careful what you wish for. I give major plaudits to Kafadar for pressing hard to see that alternative views are respected, and to counter the popular but terrible arguments of the form: since these methods are misused, they should be banished, and replaced with methods advocated by group Z (even if the credentials of Z’s methods haven’t been scrutinized!) We have already seen in 2019 the extensive politicization and sensationalizing of bandwagons in statistics. (See my editorial P-value Thresholds: Forfeit at your Peril.) The average ASA member, who doesn’t happen to be a thought leader or member of a politically correct statistical-philosophical tribe, is in great danger of being muffled entirely. There’s already a loss of trust. We already know, under the motto that “a crisis should never be wasted”, that many leaders of statistical tribes view the crisis of replication as an opportunity to sell alternative methods they have long been promoting. Rather than the properly endorsed, truly representative, statement that Kafadar seeks, we may get dictates from those who are quite convinced that they know best: “les stats, c’est moi”.

**APPENDIX. How a Working Group on P-values and Significance Testing Could Work**

I see one way that a working group could actually work. The 2016 ASA statement, ASA I, had a principle, it was #4. You don’t hear about it in the 2019 follow-up. It is that “P-values and related statistics” cannot be correctly interpreted without knowing how many hypotheses were tested, how data were specified and results selected for inference. Notice the qualification “and related statistics”. The presumption is that some methods don’t require that information! That information is necessary only if one is out to control the error probabilities associated with an inference.

Here’s my idea: Have the group consist of those who work in areas where statistical inferences depend on controlling error probabilities (I call such methods *error statistical*). They would be involved in current uses and developments of statistical significance testing and the much larger (frequentist) error statistical methodology within which it forms just a part. They would be familiar with, and some would be involved in developing, the latest error statistical tools, including tests and confidence distributions, P-values with high dimensional data, current problems of adjusting for multiple testing, and of testing statistical model assumptions, and they would be capable of different aspects of comparative statistical methods (Bayesian and error statistical). They would present their findings and recommendations, and responses sought.

The need for the kind of forum I’m envisioning is so pressing, that it should not be contingent on being created by any outside association. It should emerge spontaneously in 2020. *We take that step here.*

*Please share your comments in the comments.*

[1] This is a pun on “l’état, c’est moi” (“I am the state”, Louis XIV*.) I thank Glenn Shafer for the appropriate French spelling for my pun. (*Thanks to S. Senn for noticing I was missing the X in Louis XIV.)

[2] They are referring to the last section of ASA I on “other measures of evidence”. Indeed, that section suggests an endorsement of an assortment of alternative measures of evidence including Bayes factors, likelihood ratios and others. There is no attention to whether any of these methods accomplish the key task of the statistical significance test–to distinguish genuine from spurious effects. For a fuller explanation of this last section, please see my post from June 17, 2019 and November 14, 2019. And, obviously, check the last section of ASA I.

Shortly after the 2019 editorial appeared, I queried Wasserstein as to the relationship between it and ASA I. It was never clarified. I hope now that it will be. At the same time I informed him of what appeared to me to be slips in expressing principles of ASA I, and I offered friendly amendments (see my post from June 17, 2019).

[3] If you’re giving the history of statistics, you can speak of those bad, bad men–dichotomaniacs, Neyman and Pearson– who, following Fisher, divided results into significant and non-significant discrepancies (introduced the alternative hypotheses, type I and II errors, power and optimal tests) and thereby tried to reduce all of statistics to acceptance sampling, engineering, and 5-year plans in Russia–as Fisher (1955) himself said (after the professional break with Neyman in 1935). Never mind that Neyman developed confidence intervals at the same time, 1930. For a full discussion of the history of the Fisher-Neyman (and related) wars, please see my *Statistical Inference as severe Testing: How to Get Beyond the Statistics Wars* (CUP, 2018).

[4] I was just sent this podcast and interview of Ron Wasserstein, so I’m adding it as a footnote. There, Wasserstein et al. 2019 is clearly described as the ASA’s “further guidance”, and Wasserstein takes no exception to it. The interviewer says:

**“**But it would seem as though Ron’s work has only just begun. The ASA has just published further guidance in the most recent edition of The American Statistician, which is open access and written for non-statisticians. The guidance is intended to go further and argues for an end to the concept of statistical significance and towards a model which the ASA have coined their ATOM Principle: Accept uncertainty, Thoughtful, Open and Modest.”

[5]Nathan Schachtman, in a new post just added to his law blog on this very topic, displays a letter from the ASA acknowledging that a journal has revised its guidelines taking into account *both* ASA I and the 2019 Wasserstein et al. editorial. I had seen this letter, in relation to the NEJM, but it’s hard to know what to make of it. I haven’t seen others acknowledging other journals, and there have been around 7 at this point. I may just be out of the loop.

**Selected blog posts on ASA I and the Wasserstein et al. 2019 editorial:**

- March 25, 2019: “Diary for Statistical War Correspondents on the Latest Ban on Speech.”
- June 17, 2019: “The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)
- July 19, 2019: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i)
- September 19, 2019: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access). The article by Hardwicke and Ioannidis (2019), and the editorials by Gelman and by me are linked on this post.
- November 4, 2019:On some Self-defeating aspects of the ASA’s 2019 recommendations of statistical significance tests
- November 14, 2019: The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19)
- November 30, 2019: P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)

Nov 27 tweet in response to tweets by Lakens, senn and Mayo:

“The playground is wide. The discussion is focused on one corner. Statistics is not a stand alone discipline but Statisticians act as if they can make choices.They forget that there are customers who also have a say in this. Part of the problem is the self defeating myopic posture”

Yes, some statisticians claim that they want to be collaborators but act as if “Les stats, c’est moi”. Seems like ASA is giving the example….

Ron: It would be great if the stat “customers” spoke up and said we’re not buying what you’re selling.

They definitely are. I just sat on the thesis committee of a Phd in mediation models applied to psychology. They view the ASA debate as an anthropological curiosity (my qualifier). Talked also to economists who are simply ignoring it. My clinical and preclinical research colleagues also find the discussion mostly destructive with o constructive elements.

Your point is a but cynical. Usually it is up to the service provider to make the effort to figure out what his customers think, want and need.

My note on the theory of applied statistics raised this issue a decade ago and got through 18 versions until I decided it was time to share it. Unfortunately it did not get much attention:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2171179

rkenett:

I guess I don’t have a lot of appreciation for those who choose to be bystanders, ignoring, chuckling or keeping silent about a supposed “anthropological curiosity”. Especially those who have a deeper understanding of the issues and the statistics. Climbing to a higher level, safely focusing on the breezy meta-issues of how statistics is related to theory and substantive research–while important–is just an excuse for not intervening while the foundations of statistical method are made to float on rocky oceans and dangerous waters. I’d like to hear some support for Kafadar, or if you disagree with her call, say why.

Yes – you made this point before. However, typically the commitment of people is limited to their own discipline and, unless seriously invited, one will not step out, as you expect people to do.

.

To have researchers in psychology, economics, biology etc.. speak up requires creating an adequate opportunity for that, hence my earlier suggestion.

Regarding my perspective:

1. We need to clarify the terminology. I tried to do that in a Nature methods commentary referring to age long terms used in industrial statistics. As you pointed out, the report of the National Academies Press ignored that. For me the terminology in the report is used the wrong way. What they did do, however, is bring out the issue of generalizability which is my next to next point.

2. Research outcomes need to be presented adequately. The current trend is for journals to require a bullet list of key points in the very beginning of papers. Tal Yarkoni made a similar comment for research in psychology in general. No real methods for doing that have been proposed except for the use of alternative representations and a table delineating a boundary of meaning (BOM) whcih I made repeatedly and won’t repeat here.

3. Generalisability is a fundamental issue in science, engineering, social sciences, healthcare etc etc. In making research claims based on research findings. This raises the question of making claims of the wrong sign or the wrong magnitude. I have been using the S- and M-type errors of Gelman and Carlin for that. They have the advantage that clinicians and other domain experts understand their meaning, instantly. Generalisation is treated at length in my book with Galit Shmuel on Information Quality.

4. This is the main point. Statistics is not alone in the analytic landscape. AI/ML for example is offering methods to validate and generalise models. They do that without considering the ASA blurbs.

The bottom line is a refocusing of a quote related to official statistics:

“An issue that can lead to misconception is that many of the concepts used in official statistics often have specific meanings which are based on, but not identical to, their everyday usage meaning. Official statistics “need to be used to be useful” and utility is one of the overarching concepts in official statistics.” Forbes, S. and Brown, D. (2012) Conceptual thinking in national statistics offices, Statistical Journal of the IAOS 28, p 89–98.

By analogy, the ASA recommendations and related publications “need to be used to be useful”

Statistics has been very impactful on science in the past, it must be like that also in the future.

ron

rkenett: I don’t “expect” most people to step out from the limits of what they regard as their own discipline–who would expect such scary generalizability (the term you love) unless “seriously invited” to do so? It’s simply a consequence of being unable or unwilling to actually ponder the issues themselves, and get beyond a status quo of timidity and confusion. And by “seriously invited”, I spoze you mean the path has already been made safe for them to do so, which means they would no longer be providing any insights that could be regarded as “stepping out”.

They definitely are. I just sat on the thesis committee (une soutenance) of a Phd thesis on mediation models applied to psychology. Psy researchers I met there view the ASA debate as an anthropological curiosity (my qualifier). Talked also to economists who are simply ignoring it. My clinical and preclinical research colleagues also find the discussion mostly destructive with little constructive elements.

Your point is a however a bit cynical. Usually it is up to the service provider to make the effort to figure out what his customers think, want and need. Statisticians who position as almighty gate keepers will not understand this.

The bottom line is that, instead of “Les stats, c’est moi” we should have “Venez discuter avec nous les stats”. To be fair, some statisticians do that.

.

Convincing the customers by amassing 800 signatories to a petition in Nature does not seem to work.

At the Washington SSI conf, the organisers had a few “customers” speak (very few). To me they had very insightful messages to convey.

ASA could, for example, facilitate such discussions providing a platform for alternative views. It could be done in the tradition of a British debate With a motion discussed by two opponents and seconders.

Mayo, in a 1997 review of a stat book (Experiments in Ecology, by A.J. Underwood) I noted confusion resulting from his conflation of the testing of research hypotheses and the testing of statistical hypotheses. Like you, he thinks Popper to be relevant to statistical analysis. I think this lies at the core of our differences. That confusion was treated at greater length in Hurlbert & Lombardi (2009). Here’s an excerpt from the book review:

‘Underwood advocates a logical framework that he claims is ‘well-used and of long-standing’ and ‘in widespread use in ecology’ (p. 4). We should hope this is not true. The framework combines the falsification procedure of Karl Popper and the decision- theoretic framework of Jerzy Neyman and Egon Pearson in a way unlikely to have been acceptable to any of these fellows. And the framework is completely antithetical to R.A. Fisher’s concept of significance testing.

“In the typical situation, when the test of a null hypothesis (Ha) yields a low P value, we all agree that one has reasonable grounds for rejecting Ha and has ‘something to talk about’. On the other hand if the test yields a high P value, we have little to talk about. In particular, the high P value is not evidence in favor of either H,, or the alternative hypothesis (H,). A high P value is a recommendation only for indecision with respect to the truth of Ha. This is elementary.

“But in Underwood’s ‘logical framework’, a high P value indicates that Ha should be ‘retained’ and that H, is ‘clearly wrong’, ‘disproven’, and ‘falsified’ (p. 17).

“The fundamental difficulty seems to be Underwood’s belief that it is possible and desirable to conflate two distinct logical frameworks into one- a general framework for testing of research hypotheses and the narrow framework of significance testing by which particular data sets bearing on a research hypothesis are evaluated, one by one.”

Full review at https://www.amazon.com/Experiments-Ecology-Interpretation-Analysis-Variance/dp/0521556961#customerReviews

Stuart: I honestly don’t see what your remark about the distinction between statistical and substantive inference has got to do with the issue at hand at all. It is interesting, however, that you write a comment to this post, given your activism in getting journal editors to revise their guidelines so as to ban the use of the word “significant”. Unlike Wasserstein et al 2019, you presumably would retain “significance”, but it’s too late. They have gone whole hog–at least those in the P-value Project II.

But my point does not concern the distinction between statistical and substantive significance. It concerns the distinction between statistical hypotheses and research hypotheses. See Hurlbert & Lombardi (2009), pp. 334-338. Too long to quote here.

That’s exactly how I understand the difference. And with that semantical clarification, what does your point have to do with the issues in my post?

Seems to me that if rkennet’s comment is relevant to the issues in your post then so is Stuart’s. They both have to do with concern about the distinction between the preferences of statisticians (mistaken and silly in the case of Underwood) and the needs and practices of the scientists who might like to use statistics.

Michael: I took Ron to be saying that customers might not agree with the “les stats, c’est moi” attitude of certain stat administrators or whatever. But there is still no real detail in his comment as to why. So I concur that neither weigh in on the issue at hand.

I replied above without seeing this exchange. Actually replied twice with the first version being seemingly rejected because is includes a link I posted before…..

Read pp. 334-338, as they relate to many of your posts, and your book.

A statistical hypothesis might be that that in this particular set of patients drug A improves 1-year survival rates. A research (or scientific) hypothesis might be that for all patients with this condition, drug A improves survival with no or minimal negative side effects.

Research hypotheses are typically tested via a research program involving both observational (e.g. epidemiological) and experimental studies using different types of patients , multiple response variables, a variety of statistical analyses, etc..

Stuart: Still at a loss to connect your comment to my post.

Stuart – irrespective of the specific post thread, i have been making the distinction you are making by dichotomising the “here and now” and providing a “forward looking outlook”.

a. Here and now: you posed a question, designed a study, collected and analysed data. You correct for selective inference, report FDR, or Bayes factors, conduct a sensitivity analysis a la Saltelli or Cornfield and formulate research claims verbally (the verbal key points mentioned earlier). What Mayo mentioned is my proposal that, in the process, you state what you found with alternative representations and provide statements that appear similar but are with a different meaning so that they are not part of your research claims (i.e. what you did not find).

b. Foreword looking: you generalise your findings and provide a plan for follow up studies that will confirm (or not) you research claims. This path will enhance (or not) the strength of the claims.

The distinction between what we have now and what we plan/need/want do next, and the issue of how we represent research claims, seems missing from the p-value and significance overall discussion. There is a need for a methodology doing that, beyond printouts that come out from R, Python or other calculation platforms. This would also address the remark of Phillip Stark.

Presumably these could be included in the group discussions proposed by Karen and Mayo.

In a different context, I had a discussion with Karen on my recent book with Tom Redman titled The Real Work of Data Science. This is about the role of data scientists in organisations. It is a sort of updated version of what Deming suggested as a leader in statistical methods in organisations. From that, I did not get a sense that ASA was able/interested/willing to provide a platform for discussing this. In any case, my prior on the effectiveness of the group discussions suggested above is pessimistic. Where is the current days Tukey, or Deming? He/she need to speak out….

rkennett: Absurd to spoze we do everything at once in stat. It’s those seeking a single measure, be it a confidence index, degree belief, betting assessment, or the like who are radically oversimplifying statistical inference in science. The assessment of the statistical significance of a difference is a small part of a rich methodology that requires putting together vast numbers of pieces to form strong arguments from coincidence, ensure error control and severe tests. It’s ridiculous in the extreme that a statement of statistical significance is being declared “thoughtless” by Wasserstein et al. 2019 on grounds that it’s not taking account of everything we know at the moment it is computed. Good science is piecemeal.

Mayo – Your next blog should be titled: “Absurd to spoze we do everything at once in stat”.

There should be a course on statistics strategy. Box recommended you spend about 30% of your experimental budget on a first set of experiments and then design the next ones. This is called sequential experimentation. Box also mapped the inductive -> deductive -> inductive chain. My work with Galit Shmueli on Information Quality is also in the statistics strategy box and, of course, so is your work on severe testing.

Perhaps, a smaller scope, but having more chance to produce something useful, suggestion is to propose ASA to form a team for designing such a course. This might work.

rkenett:

Take a look at my first post on the Wasserstein et al. 2019 update, where I’m giving “friendly amendments”, not adopted.

https://errorstatistics.com/2019/06/17/the-2019-asa-guide-to-p-values-and-statistical-significance-dont-say-what-you-dont-mean-some-recommendations/

Within my third item, I also cite my book SIST, p. 162:

“This leads to my third bulleted item from ASA II:

(3) A declaration of statistical significance is the antithesis of thoughtfulness… it ignores what previous studies have contributed to our knowledge. (p. 4)

Surely the authors do not mean to say that anyone who asserts the observed difference is statistically significant at level p has her hands tied and invariably ignores all previous studies, background information and theories in planning and reaching conclusions, decisions, proposed solutions to problems. I’m totally on board with the importance of backgrounds, and multiple steps relating data to scientific claims and problems. Here’s what I say in SIST:

The error statistician begins with a substantive problem or question. She jumps in and out of piecemeal statistical tests both formal and quasi-formal.The pieces are integrated in building up arguments from coincidence, informing background theory, self-correcting via blatant deceptions, in an iterative movement. The inference is qualified by using error probabilities to determine not “ how probable,” but rather, “ how well-probed” claims are, and what has been poorly probed. (SIST, p. 162)

But good inquiry is piecemeal: There is no reason to suppose one does everything at once in inquiry, and it seems clear from the ASA II guide that the authors agree. Since I don’t think they literally mean (3), why say it?

Practitioners who use these methods in medicine and elsewhere have detailed protocols for how background knowledge is employed in designing, running, and interpreting tests. When medical researchers specify primary outcomes, for just one example, it’s very explicitly with due regard for the mechanism of drug action. It’s intended as the most direct way to pick up on the drug’s mechanism. Finding incompatibility using P-values, inherits the meaning already attached to a sensible test hypothesis. That valid P-values require context is presupposed by the very important Principle 4 of ASA I (see note (3).

……

Now it is possible the authors are saying a reported P-value can never be thoughtful because thoughtfulness requires that a statistical measure, at any stage of probing, incorporate everything we know (SIST dubs this “big picture” inference.) Do we want that? Or maybe (3) is their way of saying a statistical measure must incorporate background beliefs in the manner of Bayesian degree-of-belief (?) priors. Many would beg to differ, including some leading Bayesians. Andrew Gelman (2012) has suggested that ‘Bayesians Want Everybody Else to be Non-Bayesian’:

Yes – SIST should be a textbook for a workshop on statistics strategy. Did u propose this to ASA?

What your blog got me to think is that we have two perpendicular directions.

The horizontal one is the life cycle view of statististics. This was my 2015 Hunter conference keynote.

The vertical axis is the strategic dimension. your comments and suggestions are mostly directed to this. The information quality framework is part of that too.

Covering both axis could be the scope of such a workshop.

I agree that the discussions regarding potential reform of statistical practices have gotten out of hand. It has attempted to move too fast with too little firm understanding of the real-world role of statistics and with insufficient regard to variety of needs of statistical users (rkennet’s comment is apposite). The multitude of voices with wildly varying suggestions makes it impossible to make heads or tails of for anyone who is less than obsessed. (Happily, some of us _are_ obsessed!)

The synthesis of statistical procedures into scientific inference has to be conditioned on the needs and desires of the particular scientists involved, and the language used to explain and summarise is often field-specific. As I am a basic pharmacologist obsessed with these issues, I have put together a guide to the inferential meanings and use of p-values for basic pharmacologists (https://arxiv.org/abs/1910.02042v1). It could form a useful framework and template for any further attempts to bring clarity to this confused topic.

Michael: I will look at your linked. I’m still struggling to find the clear point you wish to make. You’re saying you can make heads or tails of….what? because you are obsessed with these (?) issues. So are you agreeing with Kafadar? disagreeing?

Michael: I meant to add that I’m intrigued that you think statistical reforms have “gotten out of hand” and “moved too fast”. I don’t know if this is a change of view on your part.

Yes, I agree with you that what you call ASA II is inappropriate, misconceived and and ill-worded. All of your suggested edits are sensible and each would improve the document. However, it would be better if the document was withdrawn.

The process has gotten out of hand because, for example, the battle against p-values is led by many who think p-values are something to do with automatic decision procedures. They do not understand the actual nature of p-values and their helpful role in some inferences. Too many ill-informed voices are making noise.

I was a participant in the drafting of ASA I and was well satisfied with what we were able to put together. The document was not without flaws, and I would have changed some things, but I think it was more helpful than not and was much better than what the pre-meeting discussions had led me to expect we would be able to achieve.

I attended the ASA symposium called something like “Moving beyond p<0.05" and, while enjoying it and finding lots of interesting ideas, I was quickly convinced that it was not going to lead to any further clarity: too many of the participants have exactly the muddled thinking that you regularly deride in your blog. There is a need for more self-reflection and thoughtful re-evaluation of publicly stated positions.

I declined an invitation to supply a paper for the special edition of The American Statistician because I expected it to be little more than a cacophony of mistaken and conflicted advice. (I feel vindicated in that expectation.)

Instead of putting together a short paper for that issue, I took advantage of having been invited to supply a chapter about statistics for the series "Handbook of Experimental Pharmacology" to put together a _long_ paper containing my attempt to synthesise ideas about how statistical inferential processes should be incorporated into scientific inference. That is the chapter that I linked in a comment above.

Yes, there have been some important changes in my thinking over the years as a consequence of improved understanding. However, I do not think that the changes have been in the direction that you would prefer. My chapter (alright, I'll link it again:https://arxiv.org/abs/1910.02042v1) is up to date on my thoughts and advice, as far as it goes.

Michael:

I read your paper; I read the last portion quickly, but I see your points. I agree with nearly everything with the exception of the philosophy and history of Fisher and N-P statistics. In the past 15 years, since my work with David Cox and Aris Spanos, I read Fisher more closely, and reread or read Neyman, Pearson, Neyman and Pearson and much else related to what really happened. Most important, I developed my reformulation of an error statistical philosophy of science by solving problems enabling me to push much further my early ideas of a non-behavioristic, “evidential” use of error probability assessments. As interesting as is the history of F-N-P, I frankly think it’s crazy to allow our contemporary use and interpretation of these statistical methods to be “limited by what someone 50, 60, or 90 years ago thought, or to what today’s discussants think they thought.” (SIST Preface, xiii). My view is “it’s the methods, stupid”. If we got over that limitation, you and I would scarcely disagree on anything.

Readers might look at “Deconstructing the N-P versus Fisher Debates”

Where we really agree is when you say that Wasserstein et al. 2019 (ASA II) “is inappropriate, misconceived and and ill-worded. All of your suggested edits are sensible and each would improve the document. However, it would be better if the document was withdrawn” (Lew comment).

But how can it be withdrawn? And would you also withdraw the proposal to stop saying “significant/significance”? I’d be glad to have a guest post from you on this. I thought you were one of the signers to the associated Amrhein et al. paper. I return to this below:

You wrote: “I declined an invitation to supply a paper for the special edition of The American Statistician because I expected it to be little more than a cacophony of mistaken and conflicted advice. (I feel vindicated in that expectation.)”

You were right. In my case it was just being too busy finishing my book and then, even after the deadline got extended, a paper. I’m glad I didn’t. I would have felt tricked into being part of the “against statistical significance” campaign. Unless authors were notified it would be wrapped up in this effort.

Now to return to your idea of Wasserstein et al. 2019 being withdrawn in some way, maybe Kafadar’s task force could work on sifting from the document those parts to be salvaged? I describe in my appendix to this post the main way I would see such a task force working constructively.

I did not put my name to the Amrheim paper, and I do not entirely agree with the stop saying “significant”.

The problem with saying something is ‘statistically significant’ is that such a statement leaves out almost everything of value that the data might say and that a scientist might learn from an experiment. I do not want anyone to think that I therefore think it a good idea to drop p-values from the statistical toolbox.

People should read my chapter for a detailed account of my opinions. It took me 30 pages to explain them and so it is not a good idea to attempt to make them clear in a comment.

Michael: I didn’t ask you to explain your account of statistical inference, I only asked how you thought Wasserstein et al. 2019, which I have often abbreviated as ASA II) could be withdrawn. It was an interesting suggestion that no one has made before. I wondered if you seriously contemplated it (even if only hypothetically), or were just expressing your unhappiness with it. If the former, specific ideas could be relevant to the new working group.

Michael,

I DID sign onto the Amrhein et al. paper but only because ONE PART of it, and hence 800+ people, did agree with the ONE concrete idea that has wide support, that of of disallowing use of the phrase “statistically significant ” and, hence, claims of “statistical significance.”

Our “Coup de grace” paper was completely supportive of the continued use of P-values in the framework of “neoFisherian significance assessment” as you know.

You are not be “entirely in agreement” with that but I’ll settle for 99% !

Do you really think ASA II — just a paper after all — should be “withdrawn,” i.e. retracted?

If that is the case, I would suggest we should do the same for > 50% of the scientific literature!

No poiint in blogging about this. If a serious case can be made for that do it in a paper in TAS.

Stuart: If you read the comment, it wasn’t I who suggested that, it was Michael Lew. I wondered if he was serious. I only made those friendly amendments back in June, and recommended greater clarity regarding its relation to the ASA P-value guidelines.

Of course, Lew’s recommendation entails no such thing about 50% of the scientific literature–where did that come from?

Stuart, I think that the ASA II is a special case. It is the first article in the special edition of TAS, and its first author is an ASA office-holder. Most readers will assume it carries the same imprimatur as ASA I, which was a (largely) consensus document from an extensive discussion and debate among a large expert panel. The limitations of ASA I are in some ways its strengths.

If the ASA II document is misleading in it authority, misleading in its wording ,and unhelpful in its advice then withdrawal might be appropriate.

I do not think that the “don’t say significant” movement will be helpful even though I say it to my colleagues. The message has to be much more nuanced and complete (as in your papers or mine) and the result of “don’t say” will likely be the adoption of non-p-value statistical approached to mindless inference. Hardly an advance, in my opinion.

Michael: I completely agree with all you say in this comment. Clearly, the ambiguity wasn’t some mistake, but rather to have one’s cake and eat it–at least for some people. Others go hungry. I’ve seen articles describing WSS as “writing for the ASA”.

New Note #4. Although Wasserstein has objected to construing Wasserstein et al. 2019 as “further guidance” from the ASA, he endorses (or at least does not correct) that view when interviewed. I can give numerous other examples.

[4] I was just sent this podcast and interview of Ron Wasserstein, so I’m adding it as a footnote to my post. There, Wasserstein et al. 2019 is clearly described as the ASA’s “further guidance”, and Wasserstein takes no exception to it. The interviewer says:

“But it would seem as though Ron’s work has only just begun. The ASA has just published further guidance in the most recent edition of The American Statistician, which is open access and written for non-statisticians. The guidance is intended to go further and argues for an end to the concept of statistical significance and towards a model which the ASA have coined their ATOM Principle: Accept uncertainty, Thoughtful, Open and Modest.”

Mayo, I’m glad to have Ron’s support for the key message in our paper “Coup de grace for a tough old bull” but I don’t see much point in worrying in a blog about the influence of one person (Ron) or one journal (ASA). The only persons who will change things in a big way are those who are writing (or revising) the most popular introductory statistics texts to reflect the modest neoFisherian position.

Jessica Utts may be doing that for one of her texts. You might invite her into the discussion, as you start a draft of your own intro stats book.

I haven’t begun to read seriously — let alone respond to — the new papers by paleoFisherian advocates, because I’m fighting equally lonely battles of greater societal significance. One is censorship with academia of discussions of population and immigration issues about which I wrote you privately earlier.

Another is the identity politics in academia, a prime example of which is my own university setting aside certain faculty positions in our College of Sciences and College of Engineering that will be open only to persons who are neither white males or asian males, despite that being illegal in California. As you will be aware, that is of high relevance to the growing numbers of asian male statisticians in the U.S.

Yes, I admit to not being very good at staying on topic!

Stuart:

The ASA P-value project II is having strong consequences. As you must know around 8 journals have revised their guidelines taking both ASA I and the Wasserstein et al. 2019 editorial into account. Some adopt the vsSS posture in doing so. Fisher’s concept is killed because of a small group. What you say about textbooks may be right, I don’t know.

I wouldn’t dream of writing an intro stat textbook.

Two simple points in support of Karen kafadar:

1. It should made very clear where Ron Wasserstein is speaking for himself and were something is ASA policy.

2. I’ve done simple counting of the number of questions at issue in over 50 applied papers and the median number of questions at issue is ~10,000. See also **. Any p-value under those conditions has no meaning. Crudely put, researchers are gaming the scientific system.

I think we need to support Karen. Understand/figure out the system. If there is gaming, that is what needs to be fixed first.

**Head ML, Holman L, Lanfear R, Kahn AT, Jennions, MD. 2015. The extent and consequences of p-hacking in science. PLoS Biol. 13(3):e1002106. doi:10.1371/journal.pbio.1002106.

Hi Stan!

On point 1, you can see in my note 4, and throughout the post, there is no real distinction under the “les stats, c’est moi” mindset that we see at ASA. As for your point in 2, what do you do: try to adjust for selection or simply view such explorations as providing hypotheses to test with distinct data. What I fail to see is how such data dredging is blocked by accounts like Bayes factors that question adjusting for multiplicity. The question is whether the problem should be visible, by leading to an illicit p-value that we can call out (because revealing the full procedure is mandatory), or at most an exploration, or whether to embrace tools that are prepared to make inferences regardless of dredging, hoping perhaps to rely on some prior.

Researchers will always game whatever scientific system there is. And if some field has some particular “culture”, e.g. has developed some irrational religious attitudes to p-values and significance, people will learn what they have to do in order to get published and to further academic careers. That’s much easier than learning how to use statistics in a responsible way. And doing it in a responsible way will certainly rapidly destroy your career progress.

A new system will be as much abused as an old system.

Secondly I want to add that the Bayesian approach, and the frequentist approach, are both *models* of how to process and learn from statistical data. All models are wrong, some are useful. I think that both are, in general, gross oversimplifications. Life is hard and many parties are involved. Bayes tells me what *I* should believe. He doesn’t tell me what to *do* unless I start specifying the different actions I could take and the different costs which each action would lead to, in each possible state of the world. Even then he doesn’t tell me what to do. Why should I minimise my *expected* costs? Why should I be able to quantify them? What about the costs of computation? Neyman-Pearson is about a two-person game. Statistics is used in science and in politics, many parties are involved, all with different interests and different knowledge and different responsibility.

I want to echo Stan’s comments and your (and his) praise of Karen. This topic has become unfortunately politicized, to the disadvantage of science.

Personally, I believe that our community is doing real harm by pushing against p-values instead of pushing against the *misuse* of p-values, tests of significance, confidence intervals, and other mindless “cargo-cult” practices that use statistics as an incantation instead of as a way to keep from fooling ourselves.

For the ASA as a body to denigrate p-values is like a scholarly society of mathematicians denigrating addition because people make arithmetic errors. Teach arithmetic better; don’t throw away a useful tool.

Philip:

Thank you so much for your comment! It’s so great to know that not all statisticians have gone mad and decided to “rise up against significance tests” as the Nature article (March 2019) sensationally declared. We ought to be able to get at least 1000 signatures endorsing your comment (not that I plan to seek them)!

I would sign it. Some seem to want to replace arithmetic with reading tea leaves.

John: True, they have examined the alt methods being put forward as replacements as much as they’ve tested tea leaves: you can see in the data whatever you want.

Pingback: Schachtman Law » American Statistical Association – Consensus versus Personal Opinion

Mayo, Thanks for this post. As you note, I have been following this issue closely because of the play that professional organization consensus statements get in court proceedings. Statistical reasoning is difficult enough for most people, but the hermeneutics of American Statistical Association publications on statistical significance may require a doctorate of divinity degree.

When the “special” issue of the American Statistician came out earlier this year, I read the Wasserstein editorial as much more than personal opinion. In my post at , I detailed why and how I interpreted the Wasserstein editorial, and how relieved I was to read President Kafadar’s forthright column in this month’s AmStat News, that the Wasserstein article lacks any imprimatur of the ASA. I just modified my post to reference your helpful discussion above.

Officers of the ASA are entitled to their opinions and the opportunity to present them, but disclaimers would bring clarity and transparency to published work of these officials when they are not writing ex cathedra. I agree with your call for greater discipline by authors and editors in the policing disclaimer practices.

Nathan Schachtman

Nathan: Thank you for your comment and excellent legal posts on this topic. It will be interested to solve the mystery of the ASA letterhead announcement. My current suspicion is that it may have been sent just to authors of the papers in the special March 2019 issue, telling about the journals that have reacted. Do people think that makes sense? I haven’t asked anyone in that reference set.

Mayo,

Thanks for the kind words. I am not sure about the “ASA letterhead announcement.” Your suspicion may be right, but mine was that it was a wider mass email. The “From” field of the email was “From: ASA .” To me that does not look like an email that was sent to a limited group. The email I displayed did not have a signatory, and it was set up with graphics, including the ASA logo and banner, that made it look, to me at least, like a mass emailing. Same with the html formatting and link buttons for the NEJM editorial and the new NEJM guidelines. Too much art stuff there for just an email sent to a limited in group.

All I had was circumstantial evidence, until I just ran a search on the email address above. One of the “hits” I got from the search was an ASA news archives of all the news emails that it sends out over this address: https://ww2.amstat.org/newsletters/index.cfm?NewsletterID=269

This link will take you to the page with the email in question.

Not sure, but there may be a Facebook page with these news items as well.

Nathan

Any sort of science that turns responsibility over to “task forces,” committees, mass petitions, and popularity contests is, deservedly, dead in the water.

Stuart: Interesting. So is it your view that the ASA ought not to have task forces giving guidelines on proper/improper uses of methods? My inclination, from the start, was no, it ought not. I’ve not seen this, but then again, I’m in philosophy of science. My inclination wrt statistics, at this time, is no, it definitely ought not, given all the reasons I have in my last 4 blogposts.

But I find it curious that you say this, given that you have promoted, in the strongest of terms, the urgency of getting journal editors and other aboard the abandon SS campaign.

So I don’t understand. Is that a change of view for you? I hope that it is, and that other come to see it. Another changed view is your calling for an end to “significance” as well. In your papers, and even recent comments, you say you didn’t suggest this.

To see what many of the signers were opposing, I think it helps to look at the Gelman blog where it was advocated. It comes across as advocating abandoning statistical significance testing altogether, not just words or even thresholds (between the rather warranted and terrible).

So I have no trouble believing the popularity of the bandwagon, but don’t forget, the other side didn’t have a chance to vote.

That’s an order of magnitude estimate of the percentage of papers with serious statistical problems.

Stuart: So there was no need for the truth of the antecedent. But anyway, I know that Lew didn’t have in mind retracting an introduction to a set of papers. It’s the haphazard statements of the principles from ASA I, and/or the new rule: don’t say significant/significance, and don’t use P-value thresholds in interpreting data–a terribly confused assertion–that I’m guessing he is referring to.

Mayo, I try to be very clear in what I say.

Thus I have never argued for an “abandon the SS campaign.”

I have no problem with the word “significance” in phrases such as “significance tests” (or, better, “significance assessment”) or “significance level” (as poor synonym for P-value), but as soon as you start talking about “statistically significant” or “significant effect” the implication is that an alpha or critical P value has been defined and that you will interpret, e.g. P values of 0.045 and 0.055 differently. My position on this point has not changed in four decades. Gelman is a newcomer to this discussion from my point of view; he’ll have to take responsibility for his own statements.

For better or worse, the only folks who will determine what is done, in at least the medium and short term, will be the authors of stat books.

“…P value has been defined and that you will interpret, e.g. P values of 0.045 and 0.055 differently.”

A few points here

1. ideally you have determined alpha based on cost of making a Type 1 error and sample size

2. you are doing this (making a distinction), but also allowing for errors of alpha and beta

3. Is there not some number of heads for which when flipping a coin n times you start to think “Hey, this coin is probably not fair!” ? i.e. surely say p-values of .001 and .65 are interpreted differently, so again, it begs the question for what distance do you treat them differently. This is not much different from establishing alpha in the first place.

Justin

Justin, you optimistically write “ideally you have determined alpha based on cost of making a Type 1 error and sample size”. Yes, ideally, but only in the case where you want to make a rules-based decision on the basis of a single designed experiment. In practice almost no-one in basic pharmacology (my area) gives the loss function any consideration and almost no-one performs the necessary power assessment when designing the experiment. That means that the approach you describe is irrelevant to actual practice. (For the arguments about why it should rarely be used you should see Stuart’s or my papers, linked elsewhere in this discussion.)

Michael: Even if one specifies their test without the appropriate planning in relation to interpretation, anyone can criticize the population effects or discrepancies that are well and poorly warranted making use of the specs actually used. So, for example, if someone announces the data indicate no relevant increased risks, and I can show (using CIs, power analysis, or SEV) that the data fail to rule out risk increases as large as d,which count as risks of concern-referring now to a regulatory mandate, say–then I can show the interpretation of the stat result is unwarranted and flawed. Regulatory documents do operate with thresholds, and we can see whether purported inferences warrant claiming to be in compliance.

In short, knowing the properties of the test enables critically scrutinizing the results, and holding experts accountable for misinterpreting their results (given the regulatory threshold). Without those, the strongest criticisms cannot even be voiced. Steven Goodman recommends the researcher report their confidence index (how strongly they believe in the increased risk, say). How do the consumers of such reports begin to critically examine his index. We’d have to wait for the harms to manifest themselves, and he can always say: “risks are uncertain”.

Stuart: I would always put “statistical” in front, so as to talk of statistical significance tests, levels, etc.

Suppose, as you advise, we do not treat P-values of 0.045 and 0.055 differently, and likewise we do not treat P values of 0.035 and 0.045 differently, and likewise we do not treat values of 0.025 and 0.035 differently, etc. Then what you have is the classic N-P “critical region” wherein all P-values ≤ .055 are treated the same way, and not differently. Then they have an “undecided region” until say P-value .25 to .5––all treated the same as no evidence against. These are one sided p-values. So you seem to be favoring the classic N-P test after all!

N-P introduced power analysis to get more fine grained distinctions, e.g., evidence that the discrepancy is less than values for which the test had high probability of detecting and move away from the “classic” test. As N-P point out, they’re using the exact same reasoning as stat sign tests in setting upper bounds via power analysis. And of course Neyman also gave us confidence interval estimators that are dual to test, but once again use the identical reasoning.

Mayo, your response to Stuart’s point about p-values of 0.045 and 0.055 is unhelpful. Outside of a thoughtfully designed experiment with carefully specified sample size, considered alpha and acceptable beta, a p-value of 0.045 is effectively equivalent to a p-value of 0.055. (And 0.025 is equivalent to 0.035.) And, given that the alpha is almost never a considered level and the power is almost never determined, the p-values are almost always equivalent. (And they all represent pretty weak evidence against the null.)

Anyone who is wondering how a 20 or 30% difference in p-value might make a negligible difference to the evidential interpretation can consult section 2.4 in my paper (https://arxiv.org/abs/1910.02042v1).

Michael: Sure. That’s what the classic N-P test is designed for. For those contexts, they recommend treating the observations (in a specified group) the same. But I take it you don’t want to claim (by an argument from the continuum) that all outcomes are to be treated the same.

The automobile is by far the most (mis)used form of transportation. Nearly 1.25 million people die annually worldwide in road crashes each year. If you own a car, sell it immediately to an authorized dealership and only rely on professional drivers and public transportation. If you do happen to drive (strongly ill-advised), ignore any bright lines you see on the roads and embrace the uncertainty of the journey, using all the displays on your dashboard and phone as a guide. (We also have several new apps you can buy.) Furthermore, do not, under any circumstances, utter the words “safe driving”. This a thoughtless phrase that should be abandoned, as attested to by thousands of signatures.

Russ: Indeed, they are “unsafe at any speed” as that old book by Ralph Nader. My April fool’s day blog (coming soon after the March 20 editorial) did a semi-serious spoof on banning stat words whose English meanings differ from their technical stat meanings.

Thanks, Russ. I think you have written an amusing and apt analogy.

I think that such a forum would be useful, and that participants should include, in addition to statisticians and philosophers of science, leading representatives of the disciplines that use the statistics, e.g. physics, biology, medicine, psychology, sociology. Different fields have different challenges. For example, it is harder to control for hidden variables in psychology than in physics. So the application of p-values & significance may be different, depending on the discipline and its ability to deliver the rigor required by potential ASA positions. Another aspect of the forum could be case studies in the different fields in which significant errors were made that were corrected in subsequent research, e.g. radical mastectomy in medicine. Another aspect could include current controversies that make a difference to many people, e.g. mammography and the US Preventive Services Task Force recommendations. As a retired family physician, I am especially interested in statistics as applied to medicine, although I have virtually no formal training in statistics.

William: Thanks for your comment. I think you are right that the forum or task force or whatever should include representatives from different fields who are familiar with the consequences of statistical recommendations. Your idea to include representatives of case studies in how mistakes and biases were (or were not) corrected is ingenious. (I will recommend it to Kafadar.) That would actually be far more important than what anyone is saying about ATOM (accept uncertainty, be thoughtful, open, modest). What methods come to the rescue in detecting mistakes–that’s what needs to be focused on. One that comes to mind is how the RCTs on hormone replacement therapy in 2002 (accompanied with statistical significance tests) corrected decades of observational studies purporting to show the benefits of HRT on age-related diseases in women.

I would suggest that it is very rare to have an objective way to estimate “the cost of making a type 1 error”. Easy enough to do arbitrarily of course.

In the context of getting a paper published or of convincing a decision-maker or regulator, and assuming methods and results (incl. P-values) are reported fully and accurately and are valid ones, the editors, reviewers and decision-makers are free to accept or reject how I may verbally describe and discuss the results. So how I may “treat them” is secondary and not a matter amenable to any general prescription. So long as most editors and reviewers remain hardcore paleoFisherians we will have problems.

As I’ve mentioned before, and given examples in my papers, for more than three decades neither I nor my students have had problems getting papers published that completely avoided specifying an alpha or critical P value and that completely refrained from use of the term “statistically significant.” It seems that most editors and reviewers are, in effect, softcore paleoFisherians focused primarily on clarity of presentation.

Stuart

Stuart: We should move away from all classifications based on personalities or the conceptions that various cults have built around important statisticians.

The fact that your people published without using a given word isn’t a good argument for banning that word or concept. You know how I feel about mandating technical word bans based on tribal manifestos.

Well, it is a very “good argument” for those persons who’ve told me or stated elsewhere over the years, “you’ll never get published” unless you stick to the conventions.

And, so far as I recollect, it is the only small, specific linguistic change suggested that, in the hands of good editors, would impel authors/researchers to think and write more clearly.

To retain it will drag along a whole lot of other unnecessary terminological and conceptual junk, like the idea that P values should be adjusted to take account of “multiplicities.”

Stuart: You know it’s not a good arg to ban fields or words because some people publish without referring to those fields or words.

As for ignoring multiple testing–aha!–now at last a key issue arises. If Wasserstein et al. 2019 had stipulated that we ought not to be concerned with, let alone adjust for, multiple testing, then we could have more clearly appraised it from the start. Such a stipulation, you know, would be in conflict with principle 4 of ASA I.

But I never said or implied that what “some people publish …..” was an argument for disavowing “statistically significant” for writing up results.

The argument for so disavowing is that there is, and never has been, any logical argument in favor of verbally dichotomizing the P scale, or useful purpose served by doing so.

“Statistically significant” is just intellectual cocaine, a crutch, a bewitching but dangerous woman.

I am NOT going to be tempted to open new emails on this thread today!!!!

Stuart: Verbal dichotomization is commonplace along many continuous scales. When I say to my wife “It’s chilly outside today”, she is not subsequently expecting a full distributional analysis of daily temperatures, but just knows to grab her jacket on the way to work and appreciates the timely and brief summary. “Nathan is walking quickly through Central Park.” “Olivia scored extremely high on her qualifying exam.” Such categorizations are quite meaningful, carry contextual understanding, and allow for some ambiguity and uncertainty.

In the spirit of full disclosure, I’ve had the pleasure of working for thirty years now at SAS, and before data explosions from high-throughput technologies in fields like genomics we could likely stake claim to computing more p-values than any other software package. (I would not be surprised if blame for the presumed reproducibility crisis shifts more towards us.) I’ve personally been involved in R&D of mixed linear models (aka hierarchical linear models), and our book SAS for Mixed Models by conservative estimates has over 10K citations. We use “statistically significant” as a shorthand way of saying that the estimated signal-to-noise ratio of a particular effect is large enough to warrant serious consideration as an active component of the scientific system under study with a statistical model. The associated p-value is valuable as a first-look statistic on a standardized probability scale and can naturally lead to error analysis and severe testing as advocated by Mayo. We sometimes genuflect to the classic 0.05 rule of thumb for illustration but overall emphasize that analysts should always calibrate and interpret p-values within the broader context of the scientific questions under study with the particular sample size and experiment design, and bring to bear all relevant background knowledge and sources of variability before making substantive conclusions.

I’ve read Hurlbert and Lombardi (2009) and your Coup de Grace paper in ASA II with Levin and Utts. You list five points and I’m on board with four of them, but not the language-policing one and the accompanying proposal. My experience has been that researchers crave and value verbal interpretations of all statistics, especially nuanced ones like p-values, and this includes thoughtful dichotomizations and rules of thumb. I’m also acutely aware of the difficulties and dangers of misinterpreting p-values and hypothesis tests in general, and concede that naive and overstated summaries can lead to publication bias. However, I would contend that the adverb “statistically” adequately qualifies “significant” and the combined phrase is not really at the heart of the problem. One of the bull’s sharp horns may have nicked you unwittingly, the relevant chemical here appears to be caffeine and not cocaine, and any witches involved are more like Samantha in Bewitched.

Russ, your point about dichotomisation in natural language is spot on, and if everyone used your definition of significance (Fisher’s, if I’m not mistaken) then we would not be worrying about p-values and statistical significance.

Lew: I think it’s clear that the problems with replication, which are claimed to be at the heart of the recent turmoil regarding statistical significance tests (as declared at the start of ASA I) have almost entirely to do with the latitude in many fields, coupled with powerful data analytic methods,that open the door to data dredging, multiple testing, and a slew of selection effects. The industrialization of stat, as Benjamini calls it, is scarcely to be blamed on p-values. The same P-hacked hypothesis can appear in a Bayes factor or other method. But there’s one big difference: your grounds for criticism may have vanished. The replicability crisis is all about the difficulty of finding small p-values when an independent group tries to replicate effects that had been found to produce statistically significant results, only now with predesignated hypotheses and tighter controls. What this shows is that the problem isn’t P-values but the data dredging. It is no surprise that replication goes way, way up when there’s an insistence upon predesignated hyps and curtailing selection effects. When a researcher is guilty of biasing selection effects & a host of QRPs, P-values do just what they should: they make it difficult to replicate whenever cheating is absent. So rather than scapegoating P-values, it should be recognized that they are the basis for identifying irreplication. You can’t say we distrust P-values while at the same time relying on them to inform us of which fields are in crisis. Trading them in for methods that lack error control would be a disaster.

Stuart:

Quite the opposite. She’s been treated as the most unpopular girl. But in this blogpost, she’s finally invited to dance.

https://errorstatistics.com/2015/08/31/the-paradox-of-replication-and-the-vindication-of-the-p-value-but-she-can-go-deeper-i/

As I think you realize, I was talking about how to “treat” (describe, think about, etc.) the result of a single test depending on whether the resultant P value was .045 or .055. You I gather would call the effect “significant” in the one case and “non-signficant” in the other. However I referred to the result, I would not make a verbal distinction between the two possibilities.

Stuart: No I would not. I described what I would do. But what you recommend is no different from the classic N-P view (where specific observations aren’t taken into account, but only whether they fall in given regions). If not, you are saying all observations will be treated the same, since the argument can be continued over all values (it’s the fallacy of the heap again.)

Of course in practice, N-P recommended reporting the actual P-value. There’s a confusion between predesignated stipulations of those results the researcher agrees will not count as in favor of his effect, and adopting fixed levels to use in all cases.

This is a comment from the outside. I consider two different ways of interpreting P-values.

Interpretation 1: A P-value is a measure of how well a model approximates the data. This approach is expounded in

On P-Values, Statistica Sinica 28 (2018), 2823-2840

and in more generality in the book

Data Analysis and Approximate Models, Monograph 133 of the Chapman and Hall series Statistics and Applied Probability, (2014).

Once the approximate nature of statistical models is accepted there are no ‘true’ parameter values. Questions of truth apply only to the real world not to parameter values. A confidence interval loses is interpretation and becomes the set of parameter values if any which give acceptable approximations. A small such interval may indicate very few acceptable parameter values rather than high precision. Statements such as ‘this model is a significant approximation’ make as little sense as saying ‘this person is significantly tall’. Instead one states the degree of approximation in terms of P-values and the height in terms of inches. From the point of view of approximation the whole ASA discussion of P-values is irrelevant.

Interpretation 2: This is at present limited to linear regression with some generalizations to robust and non-linear parametric regression. It is joint work with Lutz Duembgen of the University of Bern. Take the stackloss data and run a linear regression.

b<-lm(stackloss[,4]~stackloss[,1:3])

The factor acid concentration has a P-value of 0.344. The sum of squared residuals is 178.83

Now replace acid concentration by a covariate consisting of 21 i.i.d. N(0,1) random variables

b<-lm(stackloss[,4]~I(stackloss[,1:2])+I(rnorm(21)))

In one such simulation the sum of squared residuals was 163.29. Repeat this say 1000 times. In one such simulation (sorry) the random Gaussian covariates were better than acid concentration in giving a smaller sum of squared residuals in 343 of the 1000 simulations. The empirical relative frequency of being better was 0.343 which is suspiciously close to the standard P-values of 0.344. In fact this is a theorem. The probability of the random Gaussian covariates being better is exactly the standard P-value. This holds whatever the data, not just for the stackloss data. Thus the result is model free, there are no parameters. Furthermore the probability of the Gaussian covariates being better is exactly the standard P-value derived from the F-distribution. We call this the P-value of the covariate.

This result is, even if I say so myself, quite amazing. This P-value is model free, exact and applies to the data at hand. It does not assume that the data were generated under the standard linear model with i.i.d. Gaussian errors. Moreover it has a simple intuitive interpretation. It is the probability that Gaussian covariates are better than the covariate under consideration.

The advantages go much further. It can be extended to covariate choice including the case where the number of covariates q greatly exceeds the sample size n which is often the case in gene expression data. An example take the Boston housing data of size n=506 with k=13 covariates. Now consider all interactions of the covariates of degree at most 8. There are 203490 such interactions. The stepwise selection method selects 6 of the 203490 interactions in just 6 seconds. It outperforms lasso and knockoff in all respects . A R package gausscov is available. A first version of the paper may be found under

http://arxiv.org/abs/1906.01990

Laurie:

On your first point: I’ve never heard anyone say ‘this model is a significant approximation’. (you’re right it makes no sense.) You say “one states the degree of approximation in terms of P-values”. Do you mean something like: small/large P-values indicate a poor/adequate approximation? Or rather, convert the result into an effect size, or claims a discrepancy of d is or is not indicated?

You say: “From the point of view of approximation the whole ASA discussion of P-values is irrelevant.” The Wasserstein et al 2019 editorial claims that a small P- never indicates the presence of an effect or discrepancy. They declare further that P-values tell you nothing about scientific importance. So they think they are saying something that conflicts with what you’re saying, whether you leave your claims at a qualitative level or look at indicated effect sizes/discrepancies or the like.

The standard linear test for the importance of a regressor X in the presence of regressors Z can be described as follows. Project X on the orthocomplement space of Z. Call the projection X*. Compute R, the residuals from the model based only on Z. Compare now the size of the projection of R on X* to the value of the projection of R on a random direction orthogonal to Z.

This is what Davies and Lutz suggest. They do it by simulating Gaussian random variables, but in fact, any elliptical symmetric random vector will do.

Note, that this is not a real nonparametric bootstrap. The validity of the calculated P-value is based on the normality assumption on the noise of the model. This applies that the above projection of R is omnidirectional.

However, the more important and special property of the linear model and the least square estimator is that they are both regular. In particular, there is a property of the test that, after that many years, I still find surprising. The distribution depends only on the dimension. That is, it does not matter whether X and Z are orthogonal or almost multicolinear.

However, when we get to the ultra-high dimension, we are losing this regularity. The Lasso, in particular, is strongly shrinking to 0. As a result, replacing X (which may be highly correlated with Z) with a Gaussian independent variable may change dramatically the active set, in particular, its dimension. The bias that is created because we cannot orthogonalize with respect to Z (neither Y nor X) disappears with the simulated variable (it is orthogonal by creation). So it isn’t simple anymore.

JR: Thanks for your comment, most readers are likely not to get the gist of it. Are you agreeing w/ Laurie that P-values accomplish something important in this case?

This blog thread treats a mixture of topics. Regarding ASA I and ASA II, and ASA, which seems to have been the original goalpost, below is some forensic analysis on the background of such ASA statements.

An early 2014 ASA statement was about the value of Value Added Models used in teacher evaluations.

https://www.tandfonline.com/doi/full/10.1080/2330443X.2014.956906

This VAM statement was an implementations of “an ASA advocacy policy of issuing public statements about topics related to using data and statistical tools. These statements are member-driven and Board-approved. That is, members identify an issue to address and provide the expertise and energy to develop a well-reasoned statement. The Board reviews the statement, modifies it (or asks that it be modified) as necessary, and then approves and publicizes it.”

It is useful to look at the VAM statement as a precedent of the discussions on ASA I and ASAII, both in terms of content and impact.

Some discussions of the ASA VAM statement are:

http://www.shankerinstitute.org/blog/quick-look-asa-statement-value-added

https://kappanonline.org/value-added-models-what-the-experts-say/

Moreover, In Section 6.3 of my book on information quality with Galit Shmueli, we discuss the information quality carried out in the ASA VAM statement. https://www.wiley.com/en-us/Information+Quality%3A+The+Potential+of+Data+and+Analytics+to+Generate+Knowledge-p-9781118890653

Specifically, we state: “To summarize, the ASA VAM statement is comprehensive in terms of statistical models and their related assumptions. We summarize the ratings for each dimension in Table 6.3. The InfoQ score for the VAM statement is 57%. The caveats and the implication of such assumptions to the operationalization of VAM are the main points of the statement. The data resolution, data structure, data integration, and temporal relevance in VAM are very high. The difficulties lie in the chronology of data and goal, operationalization, generalization, and communication of the VAM outcomes. The ASA statement is designed to reflect this problematic use of VAM. It is however partially ambiguous on these dimensions leaving lots of room for interpretation. Examining and stating these issues through the InfoQ dimensions helps create a clearer and more systematic picture of the VAM approach”

So, in retrospect, the VAM statement provided a discussion platform highlighting cautionary points but was not a directive such as the uprise against statistical significance in the Amerheim et al paper.

The information quality assessment summarised above indicated the poor operationalisation of the VAM statement and the ambiguity it carries through.

I presented this, among other things, in a talk at 2014 JSM in Boston and had a follow up discussion on this with Sharon Lohr. I do not know if ASA did a retrospective evaluation on the impact of the VAM statement, they should have….

Seems that ASA I had similar information quality than the VAM statement and that the ASA II collection of papers lead to (misinterpreted) directives.

So, we moved from misinterpretation of p-values to misinterpretation of recommendations derived from a position that “Les Stat c’est moi”.

As I mentioned before in Mayo’s blog, “Les Stat” involves more disciplines than statistics and a discussion on it should be inclusive. See my testimonial on the SSI conference in Bethesda that was the basis of ASA II https://blogisbis.wordpress.com/2017/10/24/to-p-or-not-to-p-my-thoughts-on-the-asa-symposium-on-statistical-inference/

rkenett:

It is interesting to compare the ASA P-value statement with their statement of VAM models–something I know nothing about. The ASA declared that ASA I was the first time it had given a policy on stat methodology (as opposed to, I guess, policy). It is discussed at the start of the 2016 ASA I. Of course, we now know that the March 2019 Wasserstein editorial was not an ASA policy doc, or a continuation of one. So for 9 months, many have been confused about this. It’s good that Kafadar’s efforts have appeared to clear this up for now.

rkenett:

It is interesting to compare the ASA P-value statement with their statement of VAM models–something I know nothing about. The ASA declared that ASA I was the first time it had given a policy on stat methodology (as opposed to, I guess, policy). It is discussed at the start of the 2016 ASA I. Of course, we now know that the March 2019 Wasserstein editorial was not an ASA policy doc, or a continuation of one. So for 9 months, many have been confused about this. It’s good that Kafadar’s efforts have appeared to clear this up for now.

Deborah

As an example take the Kolmogoff distance of an empirical distribution function from that of a model where for me a model is a single probability distribution, N(0,1) is a model, N(10,50) is another model. You can measure how well the approximation is by stating the distance or you can give the P-value. Here as elsewhere a small P-vale means a poor approximation. The distance may be perfectly acceptable for your purpose although the P-value is very small. Alternatively the P-value may be large, a good approximation, but never the less the fit is too poor for your purpose. See page 96 of Huber’s book `Data Analysis’, the relevant chapter is entitled `Approximate Models’.

Take my example of copper in drinking water (page 8 of the book) and take the Gaussian family of models. The first problem is to decide whether there is an acceptable approximation in this family. You have to decide which features are of relevance and this will depend on your knowledge of the data. Suppose I choose the following four features; the mean, the standard deviation, goodness of fit (Kolmogoroff) and some measure for outliers. For this example see the Statistica Sinica paper. This gives you an approximation region, all (mu,sigma^2) values for which the N(mu,sigma^2) model is an adequate approximation to the water data. Here the degree of approximation is measured by the four P-values but as the first example above shows this is not always the case.

Suppose the question of interest is whether the amount of copper in the water sample exceeds the legal limit which I put at 3mg/litre for the sake of argument. I identify speculatively and provisionally the amount of copper in the water with the mu of the N(mu,sigma^2). The largest value of mu for which there exists a simga such that N(mu,sigma^2) is an adequate approximation is 2.10 (see the Statistica Sinica paper). Alternatively there is no adequate approximation with mu >=3. Now we go back and question our identification. What can go wrong, has gone wrong? This can only be done in collaboration with the owner of the data. How well can the measuring instrument be calibrated? How dependent are the measurements on the person who actually does them? How are the data transcribed? Is there any possibility of collaboration with other laboratories who have been sent the same water sample? Etc. If all is well we conclude still speculatively and provisionally but less so than before that the amount of copper does not exceed the legal limit.

My concept of approximation is based on P-values. If P-values are banned it means that I now have no possibility of judging whether a model is adequate for the data. I could accept this if a better concept of approximation was developed which does not depend on P-values. I see little in the literature of any attempt to develop a concept of approximate stochastic models although there is some work in this direction. For the Bayesians and other proponents of likelihood I state that there can be no concept of approximation based on likelihood. Wasserstein et al have no concept of approximation. Would they accept the N(10,100) model for the copper data? Of course they wouldn’t. But then we go through small steps to say the N(2.10,0.18^2) model. When do they start accepting and how do they do it.? I think we should be told.

jr

You have completely misunderstood what is going on. This happens most of the time I try to explain it. The reason I think is that the way of thinking is completely foreign to the way most people think of statistics. There was an attempt to force it into a Bayesian framework on Andrew Gelman’s blog. It failed impressively. Now you try and force it into the standard Gaussian error model or elliptical symmetric random vector. I will try again.

Firstly there is no bootstrapping, no simulations are necessary. You can calculate the probability that the Gaussian covariate is better than the actual covariate X exactly. It is with just one covariate X and k-1 covariates already included n

\text{pbeta}(ss_1/ss_0,(n-k)/2,1/2)

where ss1 is the sum of squared residuals including X and ss0 the sum without X.

This does not depend on the dependent variable y or on the covariates included as long as they are linearly independent. It does NOT depend on the normality of the errors or on the errors being elliptical symmetric. The only reason I simulated in the first post was a pedagogical one. In my experience people do not understand me when I say `the probability that the Gaussian covariate is better is …’. I think it because the method is (i) too simple (it’s so simple it cannot be true) and (ii) too much at odds with the way they think about statistics.

Th one-step method compares the best of the remaining covariates with the best of the same number of i.i.d Gaussian covariates. Suppose we have q covariates in all and k of these have already been included. If x_j is the best of the remaining q-k covariates the probability that the best of the Gaussian covariates is better is

1-\text{pbeta}(1-ss_j/ss_0,1/2,(n-k-1)/2))^{q-k}

where ss_j is the sum of squared residuals including x_j and ss_0 the sum without x_j. Again this probability is exact and does not depend on the data. It is model free.

So it is model free and it is simple, very simple. It is also very fasts, it does not overfit and it is in all respects better than lasso and knockoff.

Dear Lauri,

I do believe I understood you

.

The bootstrap, by definition, doesn’t need a computer. It is just about approximating a distribution of a statistic under the true distribution by its distribution under another distribution (e.g., the empirical, the parametric MLE, the wild). Of course, normality is needed both to the classical F test and hence to your evaluating it. Yes, for both normality is approximated asymptotically.

Do you have proof that your method works in the non-regular case of the lasso? The standard de-biased lasso needs something like the compatibility condition. How does it enter into your argument? There are necessary conditions of sparsity both on the linear model and on the node-wise lasso, how do they apply to your argument?

Best.

jr

I still don’t understand you. Take the simplest case of a dependent variable y=(y_1,…,y_n) and one covariate x=(x_1,…,x_n). These could anything with only the restriction that y \ne 0 and x \ne 0 and x \ne y. There is no model, no truth, no true distribution. Put ss_0=y’y. Regress y on x to give a sum of squared residuals ss_1=y’y-(y’x)^2/x’x. Now replace x by the Gaussian covariate Z=(Z_1,…,Z_n) where the Z_i are i.i.d. N(0,1). The sum of squared residuals is SS_1=y’y-(y’Z)^2/Z’Z. Now Z is better than x if SS_1<ss_1 or equivalently SS_1/ss_0<ss_1/ss_0. Now 1-SS_1/ss_0 =1-((Z'y)/sqrt(y^y))^2/Z'Z and (Z'y)/sqrt(y^y)=N(0,1) whatever y \ne 0. This implies that 1-SS_1/ss_0 is approximately 1- chisq(1)/n. This approximation was used in early versions of the paper but later Lutz Duembgen proved that the exact distribution is binom((n-1)/2,1/2). There is no model, there is no random error term, there are no asymptotics, no comparison of the (non-existent) `true distribution' by some other distribution. The result holds for all y and x. It is in every sense model free. Lutz Duembgen extended the result by showing that you can replace Z by a random orthogonal rotation of x. This is one and the main part of the paper.

There is a second part to the paper which does involve the standard linear model. It is shown that the F distribution holds not only for i.i.d. normal errors but for any errors whose distribution is invariant under random orthogonal rotations. This again is due to Lutz Duembgen. Note that the model free part of the paper is completely independent of this result.

We cannot carry out a private discussion comparing the Gaussian covariate method with lasso but if you wish to discuss the two methods contact me under my email address

laurie.davies@uni-due.de

All the best for 2020.