P-values

My Responses (at the P-value debate)

.

How did I respond to those 7 burning questions at last week’s (“P-Value”) Statistics Debate? Here’s a fairly close transcript of my (a) general answer, and (b) final remark, for each question–without the in-between responses to Jim and David. The exception is question 5 on Bayes factors, which naturally included Jim in my general answer. 

The questions with the most important consequences, I think, are questions 3 and 5. I’ll explain why I say this in the comments. Please share your thoughts.

Question 1. Given the issues surrounding the misuses and abuse of p-values, do you think they should continue to be used or not? Why or why not?

Yes we should continue to use P-values and statistical significance tests. Uses of P-values are a piece in a rich set of tools for assessing and controlling the probabilities of misleading interpretations of data (error probabilities). They’re “the first line of defense against being fooled by randomness” (Yoav Benjamini). If even larger or more extreme effects than you observed are frequently brought about by chance variability alone (P-value is not small), clearly you don’t have evidence of incompatibility with the “mere chance” hypothesis.

Even those who criticize P-values will employ them at least if they care to check the assumptions of their statistical models—this includes Bayesians George Box, Andrew Gelman, and Jim Berger.       

Critics of P-values often allege it’s too easy to obtain small P-values, but notice the replication crisis is all about how difficult it is to get small P-values with preregistered hypotheses. This shows the problem isn’t P-values but the selection effects and  data-dredging. However, the same data dredged hypothesis can occur in likelihood ratios, Bayes factors, and Bayesian updating, except that we now lose the direct grounds to criticize inferences flouting error statistical control. The introduction of prior probabilities –which may also be data dependent–offers further researcher flexibility.

Those who reject P values are saying we should reject a method because it can be used badly. That’s a very bad argument committing straw person fallacies.

We should reject misuses and abuses of P-values, but there’s a danger of blithely substituting “alternative tools” that throw out the error control baby with the bad statistics bathwater.

Final remark on P-values

What’s missed in the reject P-values movement is the major reason for calling in statistics in science is that it gives tools to inquire whether an observed phenomenon could be a real effect or just noise in the data. P-values have the intrinsic properties for this task, if used properly. To reject them is to jeopardize this important role of statistics. As Fisher emphasizes, we seek randomized controlled trials in order to ensure the validity of statistical significance tests. To reject P-values because they don’t give posterior probabilities in hypotheses is illicit. The onus is on those claiming we want such posteriors to show, for any way of getting them, why.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Question 2 Should practitioners avoid the use of thresholds (e.g., P-value thresholds) in interpreting data? If so, does this preclude testing?

There’s a lot of confusion about thresholds. What people oppose are dichotomous accept/reject routines. We should move away from them as well as unthinking uses of thresholds like 95% confidence levels or other quantities. Attained P-values should be reported (as all the founders of tests recommended). We should not confuse fixing a threshold to habitually use with prespecifying a threshold beyond which there is evidence of inconsistency with a test hypothesis. I’ll often call it the null for short.

Some think that banishing thresholds would diminish P-hacking and data dredging. It is the opposite. In a world without thresholds, it would be harder to criticize those who fail to meet a small P-value because they engaged in data dredging & multiple testing, and at most have given us a nominally small P-value. Yet that is the upshot of declaring predesignated P-value thresholds should not be used at all in interpreting data. If an account cannot say about any outcomes in advance that they will not count as evidence for a claim, then there is no a test of that claim.

Giving up on tests means forgoing statistical falsification. What’s the point of insisting on replications if at no point can you say, the effect has failed to replicate?

You may favor a philosophy of statistics that rejects statistical falsification, but it will not do to declare by fiat that science should reject the falsification or testing view. (The “no thresholds” view also torpedoes common testing uses of confidence intervals and Bayes Factor standards.)

So my answer is NO and YES: don’t abandon thresholds, to do so is to ban tests. 

Final remark on thresholds Q-2

A common fallacy is to suppose that because we have a continuum, that we cannot distinguish points at the extremes (fallacy of the beard). We can distinguish results readily produced by random variability from cases where there is evidence of incompatibility with the chance variability hypothesis. We use thresholds throughout science to measure if you’re pre-diabetic, diabetic, etc.

When P-values are banned altogether … the eager researcher does not claim, I’m simply describing, but they invariably go on to claim evidence for a substantive psych theory—but on results that would be blocked if they’d required a reasonably small P-value threshold.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Question 3 Is there a role for sharp null hypotheses or should we be thinking about interval nulls?

I’d agree with those who regard testing of a point null hypothesis as problematic and often misused. Notice that arguments purporting to show P-values exaggerate evidence are based on this point null and a spiked or lump of prior to it.  By giving a spike prior to the nil, it’s easy to find the nil more likely than the alternative—Jeffreys-Lindley paradox: the P-value can differ from the posterior probability on the null. But the posterior can also equal the P-value, it can range from p to 1-p. In other words, the Bayesians differ amongst themselves, because with diffuse priors the P-value can equal the posterior on the null hypothesis.  

My own work reformulates results of statistical significance tests in terms of discrepancies from the null that are well or poorly tested. A small P-value indicates discrepancy from a null value because with high probability, 1 – p the test would have produced a larger P-value (less impressive difference) in a world adequately described by H0. Since the null hypothesis would very probably have survived if correct, when it doesn’t survive, it indicates inconsistency with it. 

Final remark on sharp nulls Q-3

The move to redefine significance, advanced by a megateam including Jim, all rest upon the lump high prior probability on the null as well as evaluating P-values using Bayes factors.  It’s not equipoise, it’s biased in favor of the null. The redefiners are prepared to say there’s no evidence against or even evidence for a null hypothesis, even though that point null is entirely excluded from the corresponding 95% confidence interval. This would often erroneously fail to uncover discrepancies.

Whether to use a lower threshold is one thing, to argue we should based on Bayes factor standards lacks legitimate grounds.[1][2]

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Question 4 Should we be teaching hypothesis testing anymore, or should we be focusing on point estimation and interval estimation?

Absolutely. The way to understand confidence interval estimation, and to fix its shortcomings, is to understand their duality with tests. The same person who developed confidence intervals developed tests in the 1930s—Jerzy Neyman. The intervals are inversions of tests.

A 95% CI contains the parameter values that are not statistically significant from the data at the 5% level.

While I agree that P-values should be accompanied by CIs, my own preferred reconstruction of tests blends intervals and tests. It reports the discrepancies from a reference value that are well or poorly indicated at different levels—not just 1 level like .95. This improves on current confidence interval use. For example, the justification standardly given for inferring a particular confidence interval estimate is that it came from a method which, with high probability, would cover the true parameter value. This is a performance justification. The testing perspective on CIs gives an inferential justification. I would justify inferring evidence that the parameter exceeds the CI lower bound this way: if the parameter were smaller than the lower bound, then with high probability we would have observed a smaller value of the test statistic than we did.

Amazingly, the last president of the ASA, Karen Kafadar, had to appoint a new task force on statistical significance tests to affirm that statistical hypothesis testing is indeed part of good statistical practice. Though much credit goes to her for bringing this about.

Final remark on question 4

Understanding the duality between tests and CIs is the key to improving both. …So it makes no sense for advocates of the “new statistics” to shun tests. The testing interpretation of confidence intervals also scotches criticisms of examples where, it can happen that a 95% confidence estimate contains all possible parameter values. Although such an inference is ‘trivially true,’ it is scarcely vacuous in the testing construal. As David Cox remarks, that all parameter values are consistent with the data is an informative statement about the limitations of the data (to detect discrepancies at the particular level).

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Question 5  What are your reasons for or against the use of Bayes Factors?

Jim is a leading advocate of Bayes factors and also of the non-subjective interpretation of Bayesian prior probabilities (2006) to be used. ‘Eliciting’ subjective priors, Jim has convincingly argued, is too difficult, expert’s prior beliefs almost never even overlap he says, and scientists are reluctant for subjective beliefs to overshadow data. Default priors (reference or non-subjective priors) are supposed to prevent prior beliefs from influencing the posteriors–they are data dominant in some sense. But there’s a variety of incompatible ways to go about this job.

(A few are maximum entropy, invariance, maximizing the missing information, coverage matching.) As David Cox points out, it’s unclear how we should interpret these default probabilities. Default priors, we are told, are simply formal devices to obtain default posteriors. “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, 299), being improper.

Prior probabilities are supposed to let us bring in background information, but this pulls in the opposite direction from the goal of the default prior which is to reflect just the data. The goal of representing your beliefs is very different from the goal of finding a prior that allows the data to be dominant. Yet, current uses of Bayesian methods combine both in the same computation—how do you interpret them? I think this needs to be assessed now that they’re being so widely advocated.

Final remark on Q-5  

BFs give a comparative appraisal not a test. It depends on how you assign the priors to the test and alternative hypotheses.

Bayesian testing, Bayesians admit, is a work in progress. We shouldn’t kill a well worked out theory of testing for one that is admitted to being a work in progress.

It might be noted that even default Bayesian Jose Bernardo holds that the difference between the P-value and the BF (the Jeffreys Lindley paradox or Fisher-Jeffreys disagreement) is actually an indictment of the BF because it finds evidence in favor of a null hypothesis even when an alternative is much more likely.

Other Bayesians dislike the default priors because they can lead to improper posteriors and thus to violations of probability theory. This leads some like Dennis Lindley back to subjective Bayesianism.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Question 6 With so much examination of if/why the usual nominal type I error .05 is appropriate, should there be similar questions about the usual nominal type II error?

No, there should not be a similar examination of type 2 error bounds. Rigid bounds for either error should be avoided. N-P themselves urged the specifications be used with discretion and understanding.

It occurs to me, if an examination is wanted it should be done by the new ASA Task Force on Significance Tests and Replicability. Its members aren’t out to argue for rejecting significance tests but to show they are part of proper statistical practice. 

Power, the complement of the type II error probability, I often say is a most abused notion (only defined in terms of a threshold). Critics of statistical significance tests, I’m afraid to say, often fallaciously take a just statistically significant difference at level α as a better indication of a discrepancy from a null if the test’s power to detect that discrepancy is high rather than low. This is like saying it’s a better indication for a discrepancy of at least 10 than of at least 1 (whatever the parameter is). I call it the Mountains out of Molehill fallacy. It results from trying to use power and alpha as ingredients for a Bayes factor and from viewing non-Bayesian methods through a Bayesian lens

We set a high power to detect population effects of interest, but finding statistical significance doesn’t warrant saying we’ve evidence for those effects.

(The significance tester doesn’t infer points but inequalities, discrepancies at least such and such).

Final remark on Q-6, power

A legitimate criticism of P-values is they don’t give population effect sizes. Neyman developed power analysis for this purpose, in addition to comparing tests pre-data. Yet critics of tests typically keep to Fisherian tests that don’t have explicit alternatives or power. Neyman was keen to avoid misinterpreting non-significant results as evidence for a null hypothesis. He used power analysis post data (like Jacob Cohen much later) to set an upper bound for a discrepancy from the null value.

If a test has high power to detect a population discrepancy, but does not do so, it’s evidence the discrepancy is absent (qualified by the level).

My preference is to use the attained power but it’s the same reasoning.

I see people objecting to post-hoc power as “sinister” but they’re referring to computing power by using the observed effect as the parameter value in its computation. This is not power analysis.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

QUESTION 7 What are the problems that lead to the reproducibility crisis and what are the most important things we should do to address it?

Irreplication is due to many factors from data generation and modeling, to problems of measurement, and linking statistics  to substantive science. Here I just focus on P-values. The key problem is that in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. The fact it becomes difficult to replicate effects when features of the tests are tied down shows the problem isn’t P-values but exploiting researcher flexibility and  multiple testing. The same flexibility can occur when the p-hacked hypotheses enter methods being promoted as alternatives to significance tests: likelihood ratios, Bayes Factors, or Bayesian updating. But direct grounds to criticize inferences as flouting error statistical control is lost (at least not without adding non-standard stipulations). Since they condition on the actual outcome they don’t consider outcomes other than the one observed. This is embodied in something called the likelihood principle—.

Admittedly error control, some think, is only of concern to ensure low error rates in some long run. I argue instead that what bothers us about the P-hacker and data dredger is that they have done a poor job in the case at hand. Their method very probably would have found some such effect even if it is merely noise.

Probability here is to assess how well tested claims are, which is very different from how comparatively believable they are—claims can even be true though poorly tested. Though there’s room for both types of assessments in different contexts, how plausible and how well tested are very different and this needs to be recognized.

To address replication problems, statistical reforms should be developed together with a philosophy of statistics that properly underwrites them.[3]

Final remark on Q-7

Please see the video here or in this news article.

[1]  The following are footnotes 4 and 5 from page 252 of Statistical Inference as Severe testing: How to Get Beyond the Statistics Wars. The relevant section is 4.4. (pp. 246-259)

Casella and Roger (not Jim) Berger (1987b) argue, “We would be surprised if most researchers would place even a 10% prior probability of H0. We hope that the casual reader of Berger and Delampady realizes that the big discrepancies between P-values P(H0|x) . . . are due to a large extent to the large value of [the prior of 0.5 to H0] that was used. We hope that the casual reader of Berger and Delampady realizes that the big discrepancies between P-values and P(H0|x) … are due to a large extent to the large value of [the prior of .5 to H0] that was used.” The most common uses of a point null, asserting the difference between means is 0, or the coefficient of a regression coefficient is 0, merely describe a potentially interesting feature of the population, with no special prior believability.  “J. Berger and Delampady admit…, P-values are reasonable measures of evidence when there is no a priori concentration of belief about H0” (ibid., p. 345). Thus, “the very argument that Berger and Delampady use to dismiss P-values can be turned around to argue for P-values” (ibid., p. 346).

Harold Jeffreys developed the spiked priors for a very special case: to give high posterior probabilities to well corroborated theories. This is quite different from the typical use of statistical significance tests to detect indications of an observed effect that is not readily due to noise. (Of course isolated small P-values do not suffice to infer a genuine experimental phenomenon.)

In defending spiked priors, J. Berger and Sellke move away from the importance of effect size. “Precise hypotheses . . . ideally relate to, say, some precise theory being tested. Of primary interest is whether the theory is right or wrong; the amount by which it is wrong may be of interest in developing alternative theories, but the initial question of interest is that modeled by the precise hypothesis test” (1987, p. 136).

[2] As Cox and Hinkley explain, most tests of interest are best considered as running two one-sided tests, insofar as we are interested in the direction of departure. (Cox and Hinkley 1974; Cox 2020).

[3] In the error statistical view, the interest is not in measuring how strong your degree of belief in H is but how well you can show why it ought to be believed or not. How well can you put to rest skeptical challenges? What have you done to put to rest my skepticism of your lump prior on “no effect”?

 

 

Categories: bayes factors, P-values, Statistics, statistics debate NISS | Leave a comment

The P-Values Debate


Continue reading

Categories: J. Berger, P-values, statistics debate | 8 Comments

The Statistics Debate! (NISS DEBATE, October 15, Noon – 2 pm ET)

October 15, Noon – 2 pm ET (Website)

Where do YOU stand?

Given the issues surrounding the misuses and abuse of p-values, do you think p-values should be used? Continue reading

Categories: Announcement, J. Berger, P-values, Philosophy of Statistics, reproducibility, statistical significance tests, Statistics | Tags: | 6 Comments

August 6: JSM 2020 Panel on P-values & “Statistical Significance”

SLIDES FROM MY PRESENTATION

July 30 PRACTICE VIDEO for JSM talk (All materials for Practice JSM session here)

JSM 2020 Panel Flyer (PDF)
JSM online program w/panel abstract & information):

Categories: ASA Guide to P-values, Error Statistics, evidence-based policy, JSM 2020, P-values, Philosophy of Statistics, science communication, significance tests | 3 Comments

JSM 2020: P-values & “Statistical Significance”, August 6


Link: https://ww2.amstat.org/meetings/jsm/2020/onlineprogram/ActivityDetails.cfm?SessionID=219596

To register for JSM: https://ww2.amstat.org/meetings/jsm/2020/registration.cfm

Categories: JSM 2020, P-values | Leave a comment

Bad Statistics is Their Product: Fighting Fire With Fire (ii)

Mayo fights fire w/ fire

I. Doubt is Their Product is the title of a (2008) book by David Michaels, Assistant Secretary for OSHA from 2009-2017. I first mentioned it on this blog back in 2011 (“Will the Real Junk Science Please Stand Up?) The expression is from a statement by a cigarette executive (“doubt is our product”), and the book’s thesis is explained in its subtitle: How Industry’s Assault on Science Threatens Your Health. Imagine you have just picked up a book, published in 2020: Bad Statistics is Their Product. Is the author writing about how exaggerating bad statistics may serve in the interest of denying well-established risks? [Interpretation A]. Or perhaps she’s writing on how exaggerating bad statistics serves the interest of denying well-established statistical methods? [Interpretation B]. Both may result in distorting science and even in dismantling public health safeguards–especially if made the basis of evidence policies in agencies. A responsible philosopher of statistics should care. Continue reading

Categories: ASA Guide to P-values, Error Statistics, P-values, replication research, slides | 33 Comments

My paper, “P values on Trial” is out in Harvard Data Science Review

.

My new paper, “P Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” is out in Harvard Data Science Review (HDSR). HDSR describes itself as a A Microscopic, Telescopic, and Kaleidoscopic View of Data Science. The editor-in-chief is Xiao-li Meng, a statistician at Harvard. He writes a short blurb on each article in his opening editorial of the issue. Continue reading

Categories: multiple testing, P-values, significance tests, Statistics | 29 Comments

The NAS fixes its (main) mistake in defining P-values!

Mayo new elbow

(reasonably) satisfied

Remember when I wrote to the National Academy of Science (NAS) in September pointing out mistaken definitions of P-values in their document on Reproducibility and Replicability in Science? (see my 9/30/19 post). I’d given up on their taking any action, but yesterday I received a letter from the NAS Senior Program officer:

Dear Dr. Mayo,

I am writing to let you know that the Reproducibility and Replicability in Science report has been updated in response to the issues that you have raised.
Two footnotes, on pages 31 35 and 221, highlight the changes. The updated report is available from the following link: NEW 2020 NAS DOC

Thank you for taking the time to reach out to me and to Dr. Fineberg and letting us know about your concerns.
With kind regards and wishes of a happy 2020,
Jenny Heimberg
Jennifer Heimberg, Ph.D.
Senior Program Officer

The National Academies of Sciences, Engineering, and Medicine

Continue reading

Categories: NAS, P-values | 2 Comments

P-Value Statements and Their Unintended(?) Consequences: The June 2019 ASA President’s Corner (b)

2208388671_0d8bc38714

Mayo writing to Kafadar

I never met Karen Kafadar, the 2019 President of the American Statistical Association (ASA), but the other day I wrote to her in response to a call in her extremely interesting June 2019 President’s Corner: “Statistics and Unintended Consequences“:

  • “I welcome your suggestions for how we can communicate the importance of statistical inference and the proper interpretation of p-values to our scientific partners and science journal editors in a way they will understand and appreciate and can use with confidence and comfort—before they change their policies and abandon statistics altogether.”

I only recently came across her call, and I will share my letter below. First, here are some excerpts from her June President’s Corner (her December report is due any day). Continue reading

Categories: ASA Guide to P-values, Bayesian/frequentist, P-values | 3 Comments

On Some Self-Defeating Aspects of the ASA’s (2019) Recommendations on Statistical Significance Tests (ii)

.

“Before we stood on the edge of the precipice, now we have taken a great step forward”

 

What’s self-defeating about pursuing statistical reforms in the manner taken by the American Statistical Association (ASA) in 2019? In case you’re not up on the latest in significance testing wars, the 2016 ASA Statement on P-Values and Statistical Significance, ASA I, arguably, was a reasonably consensual statement on the need to avoid some well-known abuses of P-values–notably if you compute P-values, ignoring selective reporting, multiple testing, or stopping when the data look good, the computed P-value will be invalid. (Principle 4, ASA I) But then Ron Wasserstein, executive director of the ASA, and co-editors, decided they weren’t happy with their own 2016 statement because it “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” altogether. In their new statement–ASA II(note)–they announced: “We take that step here….Statistically significant –don’t say it and don’t use it”.

Why do I say it is a mis-take to have taken the supposed next “great step forward”? Why do I count it as unsuccessful as a piece of statistical science policy? In what ways does it make the situation worse? Let me count the ways. The first is in this post. Others will come in following posts, until I become too disconsolate to continue.[i] Continue reading

Categories: P-values, stat wars and their casualties, statistical significance tests | 14 Comments

National Academies of Science: Please Correct Your Definitions of P-values

Mayo banging head

If you were on a committee to highlight issues surrounding P-values and replication, what’s the first definition you would check? Yes, exactly. Apparently, when it came to the recently released National Academies of Science “Consensus Study” Reproducibility and Replicability in Science 2019, no one did. Continue reading

Categories: ASA Guide to P-values, Error Statistics, P-values | 20 Comments

Hardwicke and Ioannidis, Gelman, and Mayo: P-values: Petitions, Practice, and Perils (and a question for readers)

.

The October 2019 issue of the European Journal of Clinical Investigations came out today. It includes the PERSPECTIVE article by Tom Hardwicke and John Ioannidis, an invited editorial by Gelman and one by me:

Petitions in scientific argumentation: Dissecting the request to retire statistical significance, by Tom Hardwicke and John Ioannidis

When we make recommendations for scientific practice, we are (at best) acting as social scientists, by Andrew Gelman

P-value thresholds: Forfeit at your peril, by Deborah Mayo

I blogged excerpts from my preprint, and some related posts, here.

All agree to the disagreement on the statistical and metastatistical issues: Continue reading

Categories: ASA Guide to P-values, P-values, stat wars and their casualties | 16 Comments

(Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access)

.

A key recognition among those who write on the statistical crisis in science is that the pressure to publish attention-getting articles can incentivize researchers to produce eye-catching but inadequately scrutinized claims. We may see much the same sensationalism in broadcasting metastatistical research, especially if it takes the form of scapegoating or banning statistical significance. A lot of excitement was generated recently when Ron Wasserstein, Executive Director of the American Statistical Association (ASA), and co-editors A. Schirm and N. Lazar, updated(note) the 2016 ASA Statement on P-Values and Statistical Significance (ASA I). In their 2019 interpretation, ASA I “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned,” and in their new statement (ASA II) announced: “We take that step here….’statistically significant’ –don’t say it and don’t use it”. To herald the ASA II(note), and the special issue “Moving to a world beyond ‘p < 0.05’”, the journal Nature requisitioned a commentary from Amrhein, Greenland and McShane “Retire Statistical Significance” (AGM). With over 800 signatories, the commentary received the imposing title “Scientists rise up against significance tests”! Continue reading

Categories: ASA Guide to P-values, P-values, stat wars and their casualties | 6 Comments

Palavering about Palavering about P-values

.

Nathan Schachtman (who was a special invited speaker at our recent Summer Seminar in Phil Stat) put up a post on his law blog the other day (“Palavering About P-values”) on an article by a statistics professor at Stanford, Helena Kraemer. “Palavering” is an interesting word choice of Schachtman’s. Its range of meanings is relevant here [i]; in my title, I intend both, in turn. You can read Schachtman’s full post here, it begins like this:

The American Statistical Association’s most recent confused and confusing communication about statistical significance testing has given rise to great mischief in the world of science and science publishing.[ASA II 2019] Take for instance last week’s opinion piece about “Is It Time to Ban the P Value?” Please.

Admittedly, their recent statement, which I refer to as ASA II, has seemed to open the floodgates to some very zany remarks about P-values, their meaning and role in statistical testing. Continuing with Schachtman’s post: Continue reading

Categories: ASA Guide to P-values, P-values | Tags: | 12 Comments

Diary For Statistical War Correspondents on the Latest Ban on Speech

When science writers, especially “statistical war correspondents”, contact you to weigh in on some article, they may talk to you until they get something spicy, and then they may or may not include the background context. So a few writers contacted me this past week regarding this article (“Retire Statistical Significance”)–a teaser, I now suppose, to advertise the ASA collection(note) growing out of that conference “A world beyond P ≤ .05” way back in Oct 2017, where I gave a paper*. I jotted down some points, since Richard Harris from NPR needed them immediately, and I had just gotten off a plane when he emailed. He let me follow up with him, which is rare and greatly appreciated. So I streamlined the first set of points, and dropped any points he deemed technical. I sketched the third set for a couple of other journals who contacted me, who may or may not use them. Here’s Harris’ article, which includes a couple of my remarks. Continue reading

Categories: ASA Guide to P-values, P-values | 42 Comments

A letter in response to the ASA’s Statement on p-Values by Ionides, Giessing, Ritov and Page

I came across an interesting letter in response to the ASA’s Statement on p-values that I hadn’t seen before. It’s by Ionides, Giessing, Ritov and Page, and it’s very much worth reading. I make some comments below. Continue reading

Categories: ASA Guide to P-values, P-values | 7 Comments

A small amendment to Nuzzo’s tips for communicating p-values

.

I’ve been asked if I agree with Regina Nuzzo’s recent note on p-values [i]. I don’t want to be nit-picky, but one very small addition to Nuzzo’s helpful tips for communicating statistical significance can make it a great deal more helpful. Here’s my friendly amendment. She writes: Continue reading

Categories: P-values, science communication | 2 Comments

Statistics and the Higgs Discovery: 5-6 yr Memory Lane

.

I’m reblogging a few of the Higgs posts at the 6th anniversary of the 2012 discovery. (The first was in this post.) The following, was originally “Higgs Analysis and Statistical Flukes: part 2″ (from March, 2013).[1]

Some people say to me: “This kind of [severe testing] reasoning is fine for a ‘sexy science’ like high energy physics (HEP)”–as if their statistical inferences are radically different. But I maintain that this is the mode by which data are used in “uncertain” reasoning across the entire landscape of science and day-to-day learning (at least, when we’re trying to find things out)[2] Even with high level theories, the particular problems of learning from data are tackled piecemeal, in local inferences that afford error control. Granted, this statistical philosophy differs importantly from those that view the task as assigning comparative (or absolute) degrees-of-support/belief/plausibility to propositions, models, or theories.  Continue reading

Categories: Higgs, highly probable vs highly probed, P-values | 1 Comment

Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value*

.

An argument that assumes the very thing that was to have been argued for is guilty of begging the question; signing on to an argument whose conclusion you favor even though you cannot defend its premises is to argue unsoundly, and in bad faith. When a whirlpool of “reforms” subliminally alter  the nature and goals of a method, falling into these sins can be quite inadvertent. Start with a simple point on defining the power of a statistical test.

I. Redefine Power?

Given that power is one of the most confused concepts from Neyman-Pearson (N-P) frequentist testing, it’s troubling that in “Redefine Statistical Significance”, power gets redefined too. “Power,” we’re told, is a Bayes Factor BF “obtained by defining H1 as putting ½ probability on μ = ± m for the value of m that gives 75% power for the test of size α = 0.05. This H1 represents an effect size typical of that which is implicitly assumed by researchers during experimental design.” (material under Figure 1). Continue reading

Categories: Bayesian/frequentist, fallacy of rejection, P-values, reforming the reformers, spurious p values | 15 Comments

Erich Lehmann’s 100 Birthday: Neyman Pearson vs Fisher on P-values

Erich Lehmann 20 November 1917 – 12 September 2009

Erich Lehmann was born 100 years ago today! (20 November 1917 – 12 September 2009). Lehmann was Neyman’s first student at Berkeley (Ph.D 1942), and his framing of Neyman-Pearson (NP) methods has had an enormous influence on the way we typically view them.*

.

I got to know Erich in 1997, shortly after publication of EGEK (1996). One day, I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that).  He began by telling me that he was sitting in a very large room at an ASA (American Statistical Association) meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, wood table sat just one book, all alone, shiny red.

He said ” I wonder if it might be of interest to me!”  So he walked up to it….  It turned out to be my Error and the Growth of Experimental Knowledge (1996, Chicago), which he reviewed soon after[0]. (What are the chances?) Some related posts on Lehmann’s letter are here and here.

Continue reading

Categories: Fisher, P-values, phil/history of stat | 3 Comments

Blog at WordPress.com.