Don’t throw out the error control baby with the bad statistics bathwater

Posted on March 7, 2016 by Mayo

My invited comments on the ASA Document on P-values*

The American Statistical Association is to be credited with opening up a discussion into p-values; now an examination of the foundations of other key statistical concepts is needed.

Statistical significance tests are a small part of a rich set of “techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033). These may be called error statistical methods (or sampling theory). The error statistical methodology supplies what Birnbaum called the “one rock in a shifting scene” (ibid.) in statistical thinking and practice. Misinterpretations and abuses of tests, warned against by the very founders of the tools, shouldn’t be the basis for supplanting them with methods unable or less able to assess, control, and alert us to erroneous interpretations of data.

p-value. The significance test arises to test the conformity of the particular data under analysis with H₀ in some respect:

To do this we find a function t = t(y) of the data, to be called the test statistic, such that

the larger the value of t the more inconsistent are the data with H₀;

the corresponding random variable T = t(Y) has a (numerically) known probability distribution when H₀ is true.

…[We define the] p-value corresponding to any t as

p = p(t) = P(T ≥ t; H₀). (Mayo and Cox 2006, p. 81)

Clearly, if even larger differences than t occur fairly frequently under H₀ (p-value is not small), there’s scarcely evidence of incompatibility. But even a small p-value doesn’t suffice to infer a genuine effect, let alone a scientific conclusion–as the ASA document correctly warns (Principle 3). R.A. Fisher was clear that we need not isolated significant results:

…but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1947, p. 14)

If such statistically significant effects are produced reliably, as Fisher required, they indicate a genuine effect. This is the essence of statistical falsification in science. The logic differs from inductive updating probabilities of a hypothesis, or a comparison of how much more probable H₁ makes the data than does H₀, as in likelihood ratios. Given the need to use an eclectic toolbox in statistics, it’s important to avoid expecting an agreement on numbers from methods evaluating different things. Hence, it’s incorrect to claim a p-value is “invalid” for not matching a posterior probability based on one or another prior distribution (whether subjective, empirical, or one of the many conventional measures).

Effect sizes. Acknowledging Principle 5, tests should be accompanied by interpretive tools that avoid the fallacies of rejection and non-rejection. These correctives can be articulated in either Fisherian or Neyman-Pearson terms (Mayo and Cox 2006, Mayo and Spanos 2006). For an example of the former, looking at the p-value distribution under various discrepancies from H₀: μ= μ₀ allows inferring those that are well or poorly indicated. If you very probably would have observed a more impressive (smaller) p-value than you did, if μ>μ₁ (where μ₁ = μ₀ + γ), then the data are good evidence that μ< μ₁. This is akin to confidence intervals (which are dual to tests) but we get around their shortcomings: We do not fix a single confidence level, and the evidential warrant for different points in any interval are distinguished. The same reasoning allows ruling out discrepancies when p-values aren’t small. This is more meaningful than power analysis, or taking non-significant results as uninformative. Most importantly, we obtain an evidential use of error probabilities: to assess how well or severely tested claims are. Allegations that frequentist measures, including p-values, must be misinterpreted to be evidentially relevant are scotched.

Biasing selection effects. We often hear it’s too easy to obtain small p-values, yet replication attempts find it difficult to get small p-values with preregistered results. This shows the problem isn’t p-values but failing to adjust them for cherry picking, multiple testing, post-data subgroups and other biasing selection effects. The ASA correctly warns that “[c]onducting multiple analyses of the data and reporting only those with certain p-values” leads to spurious p-values (Principle 4). The actual probability of erroneously finding significance with this gambit is not low, but high, so a reported small p-value is invalid. However, the same flexibility can occur with likelihood ratios, Bayes factors, and Bayesian updating, with one big difference: The direct grounds to criticize inferences as flouting error statistical control is lost (unless they are supplemented with principles that are not now standard). The reason is that they condition on the actual data; whereas error probabilities take into account other outcomes that could have occurred but did not.

The introduction of prior probabilities –which may also be data dependent–offers further leeway in determining if there has even been replication failure. Notice the problem with biasing selection effects isn’t about long-run error rates, it’s being unable to say that the case at hand has done a good job of avoiding misinterpretations.

Model validation. Many of the “other approaches” rely on statistical models that require “diagnostic checks and tests of fit which, I will argue, require frequentist theory significance tests for their formal justification” (Box 1983, p. 57), leading Box to advocate ecumenism. Echoes of Box may be found among holders of different statistical philosophies. “What we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data…” (Gelman and Shalizi, p. 20).

We should oust recipe-like uses of p-values that have been long lampooned, but without understanding their valuable (if limited) roles, there’s a danger of blithely substituting “alternative measures of evidence” that throw out the error control baby with the bad statistics bathwater.

*I was a “philosophical observer” at one of the intriguing P-value ‘pow wows’, and was not involved in the writing of the document, except for some proposed changes. I thank Ron Wasserstein for inviting me.

[1] I thank Aris Spanos for very useful comments on earlier drafts.

RERERENCES

Birnbaum, A. (1970), “Statistical Methods in Scientific Inference (letter to the Editor),” Nature 225(5237): 1033.

Box, G. (1983), “An Apology for Ecumenism in Statistics,” in Scientific Inference, Data Analysis, and Robustness, eds. G. E. P. Box, T. Leonard, and D. F. J. Wu, New York: Academic Press, pp. 51-84.

Cox, D. and Hinkley, D. (1974), Theoretical Statistics, London: Chapman and Hall.

Gelman, A. and Shalizi, C. (2013), “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.

Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in Optimality: The Second Erich L. Lehmann Symposium, ed. J. Rojo, Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. G. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction,” British Journal for the Philosophy of Science 57(2): 323–57.

My comment is #17 under the supplementary materials:

http://www.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108

Categories: Error Statistics, P-values, science communication, Statistics | 21 Comments

21 thoughts on “Don’t throw out the error control baby with the bad statistics bathwater”

March 7, 2016

Mayo

Readers who want to weight in on one of the ASA forums (assuming you’re a member):

To discuss the ASA’s statement on p-value, log in to ASA Connect http://community.amstat.org/communities/community-home/digestviewer?communitykey=6b2d607a-e31f-4f19-8357-020a8631b999&tab=digestviewer …

or comment on Amstat News http://magazine.amstat.org/blog/2016/03/07/pvalue-mar16/ …

I invite readers who want to share their thoughts in this more private space to comment here.

Reply
March 7, 2016

Mayo

My comment is #17 in the group under “supplementary materials”.
http://www.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108

Here are some quick thoughts on select comments:
Benjamin and Berger say the inability to find a suitable replacement for p-values stems from their being more complicated and not acceptable to both frequentists and Bayesians. Berger has come up with many ingenious alternatives and “reconciliations” which he himself seems to give up on after awhile. But he’s so influential that a large part of the literature is still discussing his last attempt. General warning: beware whenever you hear from Berger that measures are “fully frequentist”. He has his own notion of frequentist error probabilities which are actually posterior probabilities (you can search this blog for several posts on the matter, with links to published papers).

Along with Senn, I most agree with Benjamini’s statement.
file:///Users/deborahmayo/Downloads/3_Benjamini%20(1).pdf
I agree with him that this document expresses a negative attitude toward p-values, and doesn’t clarify their meaning or correct role. I agree further that a much more positive way would have included an assessment of some of the other approaches which get a free ride. (Of course CIs are dual to tests, but they too are open to fallacies and dichotomous uses). Still, the very fact that Benjamini (and some others) were included in the process prevented out and out p-bashing. It’s kind of scary to think that the composition of a committee could so influence edicts from on high.

Ioannidis’ comment is extremely disappointing. On the one hand I completely agree that the main problem in many fields is that people test silly hypotheses by silly means. (See my discussions on the replication crisis in psych). However, he gives no clue whatsoever as to how appealing to Bayesian methods could help rather than hurt here. The researchers in the most questionable areas of social psych have amassed huge literatures on the effects they’re studying, and they really, really believe their theories. So the high prior would only countenance what is at least open to challenge by criticizing the error probabilities of the method. The Bayesian can only say, “I just don’t believe it”. Ioannidis’ standpoint strikes me as incoherent, frankly. I need to speak with him (seriously). (Meta-research is certainly not free of philosophy or politics.0

Berry, who I believe is a subjective Bayesian, favors no adjustments for selection but merely reporting what was done– as if people will be able to make heads or tails of the report through all the many effects of multiple testing and selection effects.

Greenland is concerned with type 2 errors, and focuses much of his comment on Neyman, which is interesting since the ASA document is so stuck on Fisherian-style tests. He opposes the kind of spiked prior that is at the heart of the constant drum beat that p-values overstate the evidence, and I entirely agree with him. Of course his position on this is opposed by others who live by the spike.

Valen Johnson is one of those who live by the spike. He’s of the group claiming p-values exaggerate the evidence, although, ironically, he’s prepared to infer alternatives much further from the null than would be countenanced in N-P testing!

Senn’s comment is the only one that brings out the problem of using the spiked prior in the current fad of viewing the statistical appraisal of a hypotheses as a cog in a giant wheel of diagnostic testing. (I’m so glad because I didn’t have space to squeeze it in.) I don’t know if Ioannidis really meant for individual hypotheses to be assessed according to the “positive predictive value” based on “prevalence” of true nulls (which is assumed to be .5, never mind that many claim that point nulls are invariably false),or whether he was just showing how a host of false findings could be in circulation if you reject on the basis of a single, just significant result, resort to cherry-picking, and violate model assumptions. Plus publication bias. But Colquihoun really believes appraisal should be done this way, that it tells you the probability you’ll be shown to be a fool, never mind the utter arbitrariness of determining prevalence.

It would have been interesting for authors to identify their “philosophy” of statistics, along with their comments.

Reply
March 7, 2016

Mayo

A little discussion with Lew over at Gelman’s blog, also revolving around the p-value doc, that I’m bringing over here: See what you think:

Mayo says:
March 7, 2016 at 8:13 pm
Michael:
I have no clue what you can mean by saying N-P repudiate ” a probabilistic interpretation of a p-value from a significance test”? Sorry, what? Are you saying Fisher favors a probabilistic interpretation of a p-value and N-P do not? Truly this makes no sense, perhaps it’s a misprint. Are you talking fiducial?(for Fisher). As for N-P, N-P spokesman Lehmann calls the p-value the “significance probability”–pretty clearly still a probability. So I’m perplexed.

Reply to this comment
Michael Lew says:
March 7, 2016 at 8:36 pm
Mayo, in the first edition of his book Lehman makes no mention of P-values by name. (It is as if they are Valdemort.) The nearest he comes is this [bits in square braces are mine]:

“In applications there is usually a nested family of rejection regions, corresponding to different significance levels. It is then good practice to determine not only whether the hypothesis [i.e. the null hypothesis] is accepted or rejected at the given significance level [i.e. at the predetermined alpha level], but also to determine the smallest significance level $\hat{\alpha}=\hat{\alpha}(x)$, the \textit{critical level}, at which the hypothesis would be rejected for the given observation. This number gives an idea of how strongly the data contradict (or support) the hypothesis, and enables others to reach a verdict on the significance level of their choice.”
—Lehman, Testing Statistical Hypotheses, p. 62.

(In the third edition, that paragraph is modified to include the phrase P-value.)

Where does it say that the P-value should be interpreted as a probability? How can that paragraph be read as anything other than an endorsement of an acceptance procedure?
———————————————–

Still perplexed. Michael seems to be saying that in order for a p-value ot be a probability it must be a posterior probability? Or maybe a fiducial probability? I don’t know what he can mean?

Lew was part of the P-value pow-wow, by the way, as you can see in the list. I think the view he’s pushing is that a p-value is merely a test statistic that needn’t have any error probabilistic meaning, which entails of course that biasing selection effects don’t matter to it. But they always mattered to Fisher, and Cox works hard to delineate the cases requiring an adjustment in order that p-value perform their intended job. We also take it up in Mayo and Cox (2006). And principle 4 of the p-value doc asserts that p-values are invalidated by selection effects, cherry-picking etc. So his conception appears at odds with the p-value document as well as established practice. But I remain mystified as to why he want’s to also deny a significance level is a probability. (Lehmann often called the attained significance levels “significance probabilities”, which might not be the best term.) In his texts and in my many discussions with him he always regarded the attained p-value as appropriate to report, he was opposed to a strict cut-off which he attributes to Fisher. The attained p-value is, for Lehmann, the type 1 error probability associated with rejecting with the result as or further than the one observed. If you search Lehmann on this blog you’ll find references such as https://errorstatistics.com/2015/11/20/erich-lehmann-neyman-pearson-vs-fisher-on-p-values/.

Reply

March 7, 2016

Michael Lew

Mayo, perhaps you could put my commentary on the ASA statement. I think that it will clarify some of the issues. I will send it to you by email.

Reply

March 10, 2016

Mayo

a link for you from a tweet of mine:

Look at what my fortune cookie says: pic.twitter.com/8Ikts5Zkz0

— Deborah G. Mayo (@learnfromerror) March 11, 2016

@UglyResearch That's why,tiny prob of x given "chance"/max prob given 'Hunan Gardens deliberately gave me the stat cookie',LR,is a bad acct!

— Deborah G. Mayo (@learnfromerror) March 11, 2016

Reply

March 7, 2016

Michael Lew

Mayo, I am keen that people understand that P-values can be used as indices of evidence in the data about the null hypothesised value of the parameter of interest of the statistical model employed. In that role, the numerical value of the P-value, which is the probability of data as extreme or more extreme, does not contain the whole meaning of that evidence. Evidence is not expressible as any single probability, as its nature is comparative. The evidence favours value A of the parameter over value B, but value C is more favoured than either A or B.

The P-value in Lehman’s (later) scheme is a probability of error that would have been entailed if that value had been chosen as alpha in advance of seeing the data. It wasn’t chosen in advance and so it isn’t.

Reply

March 7, 2016

Mayo

Michael: you just keep starting out with what an error statistician regards as a false assumption: that there’s no such thing as asking whether or not x is evidence for H, but only if x is comparatively better for H against J, where evidence is given by likelihood. We are also robbed from asking about how much better tested H is by x than by y.
Finally,the error probability always refers to one and the same probability regardless of whether it’s Mary’s life long requirement, or Isaac, merely asserting that if Ho adequately described the data generating process, I’d expect a less significant result than I got with prob (1 – p), i.e., Pr(P>p;Ho) = (1-p), and then USING this information to infer something about Ho. If you’d almost always get a less significant result were Ho adequate, and you keep bringing about such impressive p-values, then this is indicative of a genuine effect. It’s a weak inference to a genuine effect, but it is the logic of all statistical falsification in science (which is to say, all relevant falsification in science). Comparative likelihoods don’t teach me about genuine effects (I can always find a better fitting rival than Ho, even if Ho is correct.)
I don’t see why you don’t understand that one needn’t have specified a p in advance in order to use this reasoning to learn about this aspect of the process that produced the data. Time has nothing to do with it. That’s why we may never get further, because I find it utterly baffling.
Never mind p-values, consider 4 different scales with known properties show weight gain, even though no such gain shows up with objects of known weight. I infer evidence I’ve gained weight. Distinct scales, that work reliably, are not all conspiring to trick me just when I don’t know the weight of an object, while working fine on objects of known weight. I have highly probed and passed the claim that there’s evidence of weight gain.To reject such arguments from coincidence, is to reject all learning about the world–even though it’s strictly fallible.

Reply

March 8, 2016

Michael Lew

“what an error statistician regards as a false assumption”… WTF? There may be no such thing as asking whether or not x is evidence for a hypothesis _within an error-decision framework_ but that does most definitely NOT mean that such a question cannot be asked within different frameworks. That is exactly my point.

You may like to infer something about the null hypothesis, but I think that it is often much more useful to infer something about the property represented by the parameter that H0 is just one particular value of. Try not to think of the null hypothesis as being a special hypothesis, it is just a hypothesis that the parameter takes one particular value among many possible values.

You write “I can always find a better fitting rival than Ho, even if Ho is correct” again, but it is not relevant to this discussion, and it is quite misleading. Yes, the best supported value of the parameter is usually not H0 even when H0 is true, but there is no compulsion for you to conclude that the best supported value as the true value. If it is not much better supported than H0 when H0 is actually a special value of the parameter, then go with H0. That would be a rational thing to do, and it is not something that the law of likelihood or the likelihood principle forbid. Furthermore, the idea that there is _always_ a better supported value of the parameter than the true value is false, as I demonstrated in my unpublished paper that you have read.

“I don’t see why you don’t understand”. I certainly do not agree with you, but there may be reasons for disagreeing other than a failure to understand. Yes, of course you can _learn_ from a P-value even if it is not set pre-data as the threshold for a decision. That is because that P-value can be interpreted in evidential terms. I am critical of the notion that such a P-value has a useful error probability meaning.

Reply

March 7, 2016

Mayo

I really like Benjamini’s comment, and I want to place the majority of his comment here for your consideration. He was at the pow-wow the day I observed, and it was the first time I’d met him. (The comments are under supplementary materials here: http://www.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108)

It’s not the p-values’ fault
Yoav Benjamini Tel Aviv University

Abstract. I argue that ASA board statement about the p-values may be read as discouraging the use of p-values, while the other approaches offered there might be misused in much the same way. In particular, ignoring the effect of selection on statistical inferences is common yet potentially very harmful to the replicability of research results. Keywords: selective inference, industrialized science

When I was invited to participate in ASA committee, my initial response was that it would be better for the committee to draft a statement about the appropriate use of statistical tools for addressing the crisis of reproducibility and replicability (R&R) in science. Unfortunately, in response to outcries about the role of Statistics, which focused on the perceived role of the widely used p-values, the ASA board fell into the trap of formulating a statement about the p-values. The well-phrased statement demonstrates our mistake in singling out the p-value: posing the p-value as a culprit, rather than the way most statistical tools are used in the new world of industrialized science.

Admittedly, most statisticians reading this statement will agree with most of its principles (Bayesians may not agree to principle 1, frequentists will have difficulties understanding principle 6), but all principles stated are only about p-values and statistical significance. The result is a statement that will be read by our target audience as expressing very negative ASA attitude towards the p-value.

…. Non-statistical scientists, editors, policy makers or a judges, who read these principles will conclude that the p-value is indeed a very risky statistical tool, as advertised by its opponents. Avoiding its use and discouraging its use by others is just a matter of common sense. This will be the case especially since the ASA statement offers Other Approaches: “In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. “

Yet all of these other approaches, as well as most statistical tools, may suffer from many of the same problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? Should scientific discoveries be based on whether posterior odds pass a specific threshold (P3)? Does either measure the size of an effect (P5)? Isn’t our best effect size estimator useless as a single measure if not supported by a statement about its uncertainty? How can we decide about the sample size needed for a clinical trial – however analyzed – if we do not set a specific bright-line decision rule? Finally, 95% confidence intervals or credence intervals (both sharing the limitations in P2) offer no protection against selection when only those that do not cover 0, are selected into the abstract (P4).

What made the p-value so useful and successful in Science throughout the 20th century, despite of the misconceptions so well described in the statement? In some sense it offers a first line of defense against being fooled by randomness, separating signal from noise, because the models it requires are simpler than any other statistical tool needs. Likelihood ratios, effect size estimates, confidence intervals, and Bayesian methods all rely on assumed models over a wider range of situations, not merely under the tested null; Bayesian tools need further modeling, in the form of priors and hierarchical structures. Most important, the model needed to calculate of the p-value can be guaranteed to hold under appropriately designed and executed randomized experiments. The p-value is a very valuable tool, but when possible it should be complemented – not replaced – by confidence intervals and effect size estimates. The end of a 95% confidence interval that extends towards 0 indicates by how much the difference can be separated from 0 (in a statistically significant way at level 5%…). The mean difference, when supported by an assessment of uncertainty is again useful. Disappointingly, in some areas of science these methods are grossly underutilized.

Sometimes, especially when using emerging new scientific technologies, the p-value is the only way to quantify uncertainty, and can be mapped and compared across conditions (e.g. functional MRI, Gene Expression, Genome Wise Association Studies). It is recognized that merely “full reporting and transparency” (Principle 4) is not enough, as selection is unavoidable in these large problems. Selection takes many forms: selection by a table, selection into the abstract, selection by highlighting in the discussion, selection into a model, or selection by a figure. Further statistical methods must be used to address the impact of selective inference, otherwise the properties each method has on the average for a single parameter (level, coverage or unbiased) will not hold even on the average over the selected parameters. Therefore, in those same areas, the p-value bright-line is not set at the traditional 5% level. Methods for adaptively setting it to directly control a variety of false discovery rates or other error rates are commonly used. More generally, addressing the effect of selection on inference has been a very active research area, resulting in new strategies and sophisticated tools for testing, confidence intervals, and effect-size estimates, in different setups. It deserves a separate review.

The transition in large complex problems illustrates the process occurring throughout science: the industrialization of the scientific process at the turn of the century. Experimentation is done by high throughput industrial processes and their outcomes are analyzed automatically, resulting in a large number of inferences to select from. With the availability of ever-larger databases and the ease of computations, other areas of science are undergoing similar industrialization processes, yet are slow to realize these changes. For example, the estimated number of reported inferences in the 100 studies included in the “reproducibility project” in Experimental Psychology (Open Science Collaboration, 2015) range from 5 to 730, with an average of 77 (± 10) per study. We currently study the actual selection process in these complex studies (rather than merely counting) but it is enough to note that only 11 studies included any partial effort to address selection. Facing such ignorance I prefer to eyeball a set of p-values to assess the effect of selection rather than view a set of confidence intervals.

In summary, when discussing the impact of statistical practices on R&R, the p-value should not be singled out nor its use discouraged: it’s more likely the fault of selection, not the p-values’.

Reply

March 8, 2016

omaclaren

” the models it requires are simpler than any other statistical tool needs. Likelihood ratios, effect size estimates, confidence intervals, and Bayesian methods all rely on assumed models over a wider range of situations, not merely under the tested null; Bayesian tools need further modeling, in the form of priors and hierarchical structures. Most important, the model needed to calculate of the p-value can be guaranteed to hold under appropriately designed and executed randomized experiments.”

Personally I think models come first (Senn also tweeted something about this recently wrt randomisation) and so would prefer measures that require more explicit modelling. More modelling gives more details gives more explicit assumptions to check.

Reply

March 8, 2016

Mayo

Om: If you’re data’s a mess, models won’t cure.It’s like Fisher’s warning not to call the statistician in afterwards, unless you want a mere autopsy. Having said that, some fields are intrinsically observational and have to live with that. All Spanos’ model checking is for economics.

Reply

March 8, 2016

omaclaren

If your data are a mess then your model should reflect that. I didn’t meab you could always get good conclusions, just that a model is a good way of stating your assumptions up front.

Reply

March 10, 2016

john byrd

After having been through all of the commentary and the primary document, I find it very surprising that these principles are not broadened to reference all use of statistics, not just p-values. I suppose that organization felt compelled to fixate on p-value. They will be fixating on abuse of priors in Bayesian statistics inevitably– and ironically given the abuse of spiked priors in some of the commentaries. While there are clearly mis-uses of ‘p”, I think the ASA has done a disservice to the lay public by creating the impression that there is a special problem with significance tests in general, not their incorrect application/interpretation.

Reply

March 10, 2016

Mayo

John: I quite agree, and Benjamini says much the same. Some have a feeling of political pressure. I doubt they’ll issue edicts on abuses of priors because, well, if it’s your belief then it’s hard to fault. You know the expression: Bayesianism means never having to say you’re wrong.

Reply

March 10, 2016

john byrd

I understand, but the problem will arise when conclusions reached through use of poorly justified (or calibrated) priors cannot be replicated. It will only deepen the distrust if researchers say that their beliefs are different, and that is why we cannot replicate.

Another aspect of the ASA statement that surprises me is that some statisticians seem surprised that p-values should vary in repeated runs of an experiment. Is it a revelation that the value of “p” should have a distribution in repeated runs? I do not think it was a revelation to Fisher or Neyman or Pearson. Mike Lew’s paper with the simulations showed how the likelihood and p vary under repeated sampling under controlled settings and how they relate to one another. I found that very helpful to see, but not a surprise.

Reply

March 10, 2016

Mayo

John: right, if there wasn’t variability, there’d be no need for statistics. Sometimes I wonder if people forget that statistics is about variable phenomena. The trick is to rein it in so that it becomes a more familiar kind of variability that we can control.

Reply

Pingback: Friday links: Andrew Hendry’s shadow cv, Hope Jahren on sexual harrassment, and more | Dynamic Ecology
March 11, 2016

Mayo

My conversation over the past few days with Norm Matloff, Mad (Data) Scientist on the ASA p-value doc

https://matloff.wordpress.com/2016/03/10/p-values-the-continuing-saga/

matloff
Mad (Data) Scientist
I’ll join you on Elba, Deborah, but I will have to live on the opposite side of the island. 🙂

Mayo
You’ll miss the pow wows,and for what, an utterly arbitrary distinction

Matloff
I’ll commute. 🙂

Mayo
I’d want you nearby because of your special abilities. Look, whatever you don’t like about tests can be reformulated in terms of estimation, only better.
See if you can get this link (about the consequences of only using CIS):

Matloff:
But, Deborah, that so-called equivalence is exactly the kind of theoretical analysis that I believe has led the profession so far astray.
I’m not sure how you intend me to get that link.

Mayo:
But, Norm, how did this mathematical duality lead the profession so far astray? What’s it even got to do with it?

Matloff
An example is one already mentioned, the checking of “whether the CI contains 0.” This reduces the CI to a hypothesis test, and thus defeats the purpose of the CI, tossing out the additional information it provides. In most settings, a narros CI near 0 has the same practical meaning as one that contains 0, but the math stat people treat them as radically different.
As I’ve said, the math can be quite elegant, and my own background is in pure, highly abstract math, but the theory just isn’t consistent with the real world.

Mayo
But I never advocated such a use of CIs. You seem to be arguing that in order to appreciate the duality of a CORRECT use of tests andCIs you’re led to an INCORRECT use of CIs.
I’m bringing this conversation over to my blog. You’ll have to take the ferry, or move over here where you belong.

Matloff:
Aren’t you at least going to give me a token for the ferry? 🙂
I should hasten to point out that I never said (or thought) that you check CIs for containing 0. Actually, I was referring to mathematicians, not philosophers. 🙂
By the way, in light of your Frequentists in Exile theme (one of the cleverest names I’ve ever seen in academia), I contend that one of the major (though not necessarily conscious) appeals that Bayesian methods have for followers of the philosophy is the mathematical elegance.

Mayo:
6 months of tokens (then my book should be out and you’ll be convinced). So you’ve got to come over here to talk.

Modus Ponens is mathematically elegant as are many other logical truths, yet it/they scarcely encompass statistical inference in science which is all about learning new things. Bayes’s theorem is a deductively valid argument, given the definitions, and what comes out is never beyond what goes in.

Reply
March 13, 2016

Mayo

Here’s a link to Gelman’s comment:

Click to access 7_gelman-1.pdf

I appreciate his recognition that anything that can invalidate a p-value (or other error probability) is in principle relevant to scrutinizing a study. This goes to show the importance of techniques that can pick up on such biasing selection effects.

But he’s really disturbed about moving from rejecting a null to purporting to have evience for a research claim. That criticism would hold even if the p-value was kosher. This is the statistical significance doesn’t warrant a substantive research claim. The reason is that there are many ways the research claim may be false that haven’t been probed in the least by dint of finding stt sig—evenif we imagine that was done correctly. N-P tests deal with this by insisting the alternative be a sttistical one such that the null and alternative exhaust the space (for the given question). I am the first to say this account needs supplements, but even Neyman hinted at using power analysis post-data in order to interpret non-statistically significant results. My papers delineate 4 main places.I modify this using a data-dependent notion of severity.

But my real question is this: why are we letting pseudoscientific and QRPs be the standard for what statistical methods are legitimate? It’s as if some people are saying, well we have to do shoddy science, instead of ousting certain fields as pseudoscientific or questionable science. I quite agree with Gelman’s call to worry about measurement.

So, first off, no leaps from statistical h: some stat effect,to substantive H: a causal explanation, unless it is shown that the ways the inference to H maybe flawed are decently ruled out.
Second, those areas that fail to test the validity of their measurements as relevant for the research claim inferred, and fail even to make progress in improving measurements, should be declared pseudosciences, fringe sciences. It’s not a matter of living with uncertainty. It’s a matter of living with the fact that there simply aren’t regularities in many of the areas some people seek for them.

Reply
Pingback: On Some Self-Defeating Aspects of the ASA’s (2019) Recommendations on Statistical Significance Tests (ii) | Error Statistics Philosophy
Pingback: The ASA’s P-value Project: Why it’s Doing More Harm than Good (cont from 11/4/19) | Error Statistics Philosophy

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

Don’t throw out the error control baby with the bad statistics bathwater

Post navigation

21 thoughts on “Don’t throw out the error control baby with the bad statistics bathwater”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018. All Rights Reserved.

Don’t throw out the error control baby with the bad statistics bathwater

Related

Post navigation

21 thoughts on “Don’t throw out the error control baby with the bad statistics bathwater”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018. All Rights Reserved.