This was initially posted as slides from our joint Spring 2014 seminar: “Talking Back to the Critics Using Error Statistics”. (You can enlarge them.) Related reading is Mayo and Spanos (2011)

This was initially posted as slides from our joint Spring 2014 seminar: “Talking Back to the Critics Using Error Statistics”. (You can enlarge them.) Related reading is Mayo and Spanos (2011)

Possible typo: If this is supposed to be Hawaiian, it should be “Kalikimaka” not “Chrismukkah”.

See here:

https://en.wikipedia.org/wiki/Hawaiian_language#Phonology

and here:

https://en.wikipedia.org/wiki/Mele_Kalikimaka

President Obama pronounced it very well as he closed his recent press conference. Of course, he spent much of his youth in Hawaii and knew how to pronounce it.

My cousin happened to call me this evening (out of the blue) and knew exactly what I meant when I wished him “mele kalikimaka”. I did not know that he had spent time when young in Hawaii, so that was a bit of fun.

Unless, of course, you have another language in mind.

BR: Why would I speak Hawaiian? None of this Father Damien luau volcanic ash stuff.

He was made a saint because of his work with people with leprosy (for those unaware of Hawaiian history)

Chrismukkah is pretty obviously (or so I would have thought) a portmanteau of Christmas and Hanukkah.

Right, I assumed the previous commentator was just making a joke, so I continued it.

This is the first time I see responding to the problem of reverse null and alternative. But what if the data are not in the middle of two simple hypotheses, say closer to the null?

e.berk: makes no difference, nice to hear from you.

What’s wrong with “all models are wrong”? I know that you argue against this statement for quite some time already, but I am not yet convinced. Granted, models can be “adequate” and useful, and assumptions can be tested. However, puttig aside the fact that I still have hardly seen any severity calculations for misspecification tests, even if they existed, one could never literally confirm the model, only, in the best case, some neighbourhood of the model, which I think you know. I think we agree that all models are idealisations, and probably we also agree that no two things in the world are “truly independent” etc. – so “all models are wrong”, aren’t they? One can argue that this is no big deal for modelling, one can argue that the statement “all models are wrong” is often used in an unhelpful way, but how can you argue that the statement is false?

Christian: The trouble is that, like many truisms, “all models are wrong” is taken to have real substantive implications, whereas, like all truisms, they are utterly uninformative. Here, I have actually heard people say that since all models are wrong there is no point to testing them, they are already falsified. Not only is this absurd, because what we really want is to know in what ways they fail, you’d have to say both a theory and its rivals are falsified. Another horrendous alleged implication I have heard is that we should not care about power since all models/statistical hypotheses are false. Some even say the error probabilities in rejecting hypotheses is 0 since it’s always correct to declare them false. The bottom line is that this truism, often declared with a pseudo-philosophical flourish, is taken to have implications which it doesn’t have at all. It’s somewhat akin to “everything is subjective” because anything humans do is strictly speaking done by a ‘subject”. Any proposition expressed in language is strictly false because language, like models, invariably represents partial and limited facets.

By the way, denying the implications people attach to this truism isn’t the same as declaring one can prove clams true, we are all fallibilists, etc. etc.

Aris Spanos, in his comment, has brought out other wrong-headed meanings often associated with this truism.

As I often say, I am interested in human knowledge and how we can get more of it faster. Acknowledging the fact that human knowledge is incomplete and contains gaps is not the slightest bit helpful in advancing this.

Merry Xmas

Christian: the slogan “all models are wrong” is not only unhelpful as a sophism, but it is also misleading because it stems from a serious confusion between two different kinds of falseness.

The first has to do with a substantive model being an approximation of the reality it aims to shed light on (describe, explain, predict), which we can call substantive inadequacy.

The second is statistical inadequacy, which has to do with the falseness of the probabilistic assumptions one imposes on the data; often indirectly via error terms.

This distinction stems from the fact that in empirical modeling behind every substantive model there is a statistical model which represents solely the probabilistic assumptions one imposes on the data Z₀, or more accurately on the stochastic process {Z_{t},t∈N} underlying data Z₀.

When any of these probabilistic assumptions are invalid for the particular data, the nominal error probabilities are likely to be very different from the actual ones, if one is a frequentist, and the likelihood function is erroneous if one is a Bayesian or a likelihoodist.

Statistical adequacy pertains exclusively to whether the statistical model “adequately accounts” for the chance regularities in Z₀, or equivalently, whether the particular statistical model could have generated Z₀. “Adequately accounts” does not refer to these assumptions being exactly right! Indeed, anybody who invokes exactitude in statistical modeling and inference is majorly confused. How can one make “adequately accounts” more precise? There are a number of ways one can do that. The one I strongly prefer is in terms of the discrepancies between the relevant nominal and actual error probabilities induced by different departures from the statistical model assumptions. Are these discrepancies large enough to lead an inference astray? I published several papers where i explain and illustrate the above remarks for those who want to see the details.

OK, I agree with petty much all of what both of you wrote here. And it seems we agree that the problem is not so much with the statement “all models are wrong” itself, but with its use.

Actually I think that there is some good use for it, for example being precise about what can be achieved when rejecting or not rejecting models, even with severe tests (neighbourhoods of models can be confirmed, under assumptions, not exact models). It is a good reminder against taking models too literally.

“All models are wrong” is prone to misunderstandings, but saying that the statement is false can be misunderstood, too, unless a more detailed account is given about the role of models and in which way they can be “false” or “adequte”.

Christian: we do give a fuller account. Remember, too, that statements are true or false, so the truth value depends on what you say about the model, or models in general.

Very interesting, Aris, but as I am on holidays (merry boxing day to you) I have only a couple of brief comments. First, why not show some severity curves for the interesting cases where the inferences might be affected by the experimental design in a manner that leads to the severity approach yielding substantially different inferences from a likelihood-based method? Optional stopping, for example. That would be far more useful in my opinion.

Second, you have misrepresented the likelihood-based conception of evidence by contrasting the probability assuming a hypothesis true with the probability assuming the hypothesis false. The likelihood approach deals with comparisons of hypotheses such that the evidence favours one over the other. It does not normally apply to a single hypothesis in the manner that you present. This is a very severe mistake.

Michael: I think I’ve had enough blogposts as of late regarding the failures of the comparativist simple likelihood approach. Not all “likelihoodists” follow the Royall style, but to the extent that they do, as I’ve argued, the account is bankrupt as an account of inference. (The others have different difficulties.)

Michael: I spent a lot of time working on SEV functions for optional stopping under the normal model with the sufficient statistic (x_bar, n), so I can assert with some confidence that no *unique* SEV exists in that problem.

Corey: What does that mean? I take it it means that the SEV assessment depends on when it stops as well as the nature of the stopping rule, but that’s what one would want because the possible outcomes and error probabilities differ.

Mayo: What it means is that the SEV function will depend on the details of the way the test procedure maps a given choice of alpha to the boundaries of the rejection region. Since there is no uniformly powerful one-sided test, different test procedures can be envisioned that do that mapping in different ways, and there will be a SEV function corresponding to each choice. This is in contrast to the situation where n is fixed, in which case there is only one way to map a value of alpha to a (one-sided) rejection region.

Corey, that is a very important result, I think. Can you condition on the optional stopping at the same time as conditioning on the sample size at the stop? In my own attempt I found that if I condition on the sample size then the influence of the stopping rules drops out entirely and the severity curve tells exactly the same story as the likelihood function.

Of course, that would be expected in the circumstance where the likelihood function and severity curves both tell us how well the data support parameter values relative to each other. That is why I have been asking Mayo to give severity curves for situations where the error statistician might like to form inferences that are not supported by the likelihood function alone.

Michael: I am surprised you can ask this. Maybe start with the 4 posts on the Law ofLikelihood. We always reach different inferences, because we compute error probabilities and they don’t. Selection effects, multiple testing, stopping rules, randomization,—any uses of sampling distributions. Remember too that Royall can never appraisal how poorly warranted a single hip is, can’t do compound hypes.

Michael: I’ve found a model in which the SEV function is unique and gives a different result than the likelihood function. As with optional stopping, it involves altering the sample space while leaving the parameter space unchanged (relative to the fixed sample size normal model).

The trick: continue using the normal model with fixed sample size (wlog n = 1) — just remove a chunk of the sample space, say, the interval [-1.96, 1.96]. This induces a peculiar “truncation” term in the sampling distribution, but the proof of the existence of a UMP test still goes through because the monotone likelihood ratio property still holds (the truncation term cancel out of the ratio). Since SEV involves an integral over the sample space and likelihood/Bayes methods do not, this breaks the symmetry of the unmodified normal model.

Michael: I deal with both of those issues in:

““Who Should Be Afraid of the Jeffreys-Lindley Paradox?” Philosophy of Science, 80: 73-93, 2013.”

Aris, I don’t see where either issue appears in that paper. It contains no severity curves, no optional stopping and the misapplication of likelihood ratios to H and not-H does not occur.

Mayo seems to think I’m a pest, so I now give up.

Michael; table 1 in that paper is a severity curve evaluated for a particular example to illustrate the large n problem. The optional stopping is just an example where one has to be careful how to define the relevant pre-data error probabilities to take into account the sequential nature of testing involved. As such, it has nothing to do with the post-data severity evaluations.

Mayo:

Perhaps it would help to remember that when I (and many others) criticize null hypothesis significance testing, we are

notcriticizing error statistical methods. Indeed, I think I identified myself as an error statistician in my paper with Shalizi.Here’s how I see things:

1. A reformer (such as Jeremy Freese or myself) criticizes null hypothesis significance testing as it is performed in a real published paper (for example, the himmicanes and hurricanes study, or the air pollution in China study, or the ovulation and voting study).

2. You express concerns about the recycling of “howlers.”

Instead, why not try this:

2. Simply agree with us that null hypothesis significance testing is

noterror statistics as you would like to see it.To put it another way: I (and others) are not picking on these foolish studies because we have a problem with error statistics. Rather, we have a problem with null hypothesis significance testing (the all-too-common procedure in which a researcher attempts to demonstrate the truth of his or her preferred theory A by rejecting null hypothesis B).

From reading your blog and your comments on my blog, I get the feeling that you think there’s a simple bipolarity, a tug of war between the error statisticians (including you) and the users of classical p-values on one side, and the reformers, critics, and Bayesians on the other side.

I suggest you spin things around a bit and recognize that you can be on the same side as the reformers (or, as you put it, the “reformers”). I and others have no problem with error statistics as used to criticize models that we are using. My problem is with

null hypothesis significance testing(which you refer to as “so-called NHST” and other people simply refer to as NHST), not with probabilistic model checking.Just for example, regarding Aris’s point #1 in the above slides, I completely agree that error-statistical tools can use background knowledge. See, for example, my recent paper with John Carlin in Perspectives on Psychological Science. Unfortunately, many many researchers (including the authors of the papers referred to in item 1 above of this comment) do

notuse background knowledge in their inferences, dooming them to scientific failure, over and over and over again. I’d be happy to say that these researchers are not truly using error statistics; rather, they are using some of theformsof error statistics (notably, strong hypotheses and p-values) in what Feynman memorably referred to as “cargo cult science.”Again, I say to you: join the reformers on this! No need to reflexively take the side of anyone who uses a p-value. The true error statisticians are right here all along.

Andrew: I appreciate your interesting comment.

There’s a lot I’d want to say to it, but it’s already past midnight, so I’ll “just” say this (and ponder it again tomorrow)–oy it got long: Anyone who equates statistical significance testing or statistical hypotheses testing of either the Fisherian or N-P varieties with what you call “the all-too-common procedure in which a researcher attempts to demonstrate the truth of his or her preferred theory A by rejecting null hypothesis B” is being misleading. Going from statistical significance (let alone biased, fraudulent statistical significance) to demonstrating the truth of his or her preferred scientific theory, is an ancient fallacy, perhaps exacerbated nowadays by high-powered, big-data dredging, with too little emphasis on experimental design.

I’m very glad you endorse error statistics, really, but the thing is—it stands for something. It concerns using probability to control and assess the capability of methods to detect mistaken interpretations of data (error probabilities). Statistical tests, and cognate methods, serve that goal.Other methods that further that goal inherit the error statistical foundation.

I get the feeling (sometimes) that it is YOU who think there’s a simple bipolarity, namely between

[1]The absolute worst abuses of tests that commit every fallacy in the book ALL AT ONCE coupled with being in areas of dubious scientific legitimacy with extremely weak or non-existent theory, (going from nominal, not actual, statistical significance to substantive causal claims, through any amount of finagling, cherry-picking, barn hunting, bias, make believe proxies and all the rest, as in much social psychology), and

[2] being limited to a strict falsification of a statistical model M which then allows inferring not-M (I guess, I’m not sure what more is allowed in your probabilistic model checking).

I don’t think you really can mean this, but it’s hard to know how else to interpret your comment.

Even though I and any sensible statistical practitioners supplement N-P tests with estimation and severity estimates*, the idea that the theory and practice of statistical testing should be judged by examples that violate ALL the requirements of sound statistical inference is beyond silly—it’s irresponsible. This completely omits (among much else) such examples as testing causal claims based on randomized-controlled, blinded trials. These are based on significance tests. Were we to have ousted such tests, as some** “reformers” appear to recommend, we never would have found out that hormone replacement therapy doesn’t reduce risks of heart disease in women, but rather increases them, increases breast cancer risk, etc. (just one of many examples that I care about). They were so sure every woman should be given HRT that for years women were obstructed in their attempts to get the RCTs.

While programs like the Cochrane collaboration work to contribute to stringent, scientific studies, based on methods that control error probabilities, many “reformers” keep busy trying to tear down the inferential framework of testing (“reflexively” dumping on the use of p-values even in the Higgs particle analysis) based solely on people who use statistics for pseudoscientific window dressing, or, as with Lindley, simply to promote their rival statistical account. It’s hard to know which is the more problematic: those who fall for or promote the “chump effect” http://www.weeklystandard.com/articles/chump-effect_610143.html#

or those who equate all of frequentist statistical inference to instances of “chump effect” articles. There is one cottage industry publishing chump effects, and another publishing critiques of chump effects, perhaps offering a non-frequentist “cure”. All the cures I’ve seen show serious misunderstandings and (as Spanos might say) are likely to kill the patient!

Let’s think about this some more later–we should be on the same side.

*Along with background information based on biases, flaws and foibles: the choice is not between no background information and background in terms of prior probabilities.

**replace “your” with “some”

A longer reply, rereading “The Garden of Forking Paths” paper but not well-proofed.

http://rejectedpostsofdmayo.com/2014/12/27/gelmans-error-statistical-critique-of-data-dependent-selections-that-vitiate-p-values-an-extended-comment/

I’m actually much harsher than Gelman when it comes to the studies he criticizes: they fall into my classification of questionable or pseudo-science. But I’m not sure why Gelman seems to place his own data-dependent modeling along side some of the flawed studies he rightly criticizes. There are crucial distinctions. Well, anyway, my extended comment is on my Rejected Posts blog.

As someone who is neither statistician nor philosopher, let me add to the mix of problems the combination of canned statistical software packages with poor teaching of statistics to graduate students in science fields. Someone– maybe me– should conduct a survey of scientists asking about basic philosophical foundations of the stats they are relying upon in their research. My perception is that it will be a mess. Yes, many misuse p-values and create nonsense results that are at best not replicable. And, I have plenty of literature criticizing such practice. But the next cottage industry will be lampooning the nonsensical use of Bayesian stats by scientists who do not really know what a prior or posterior mean to anyone else. I see this problem as potentially more serious because of the greater ambitions that go with these stats (prob my hypothesis is true?).

John: You must be kidding to suggest “the next cottage industry will be lampooning the nonsensical use of Bayesian stats by scientists”. Anyone who tried that would be subjected to extraordinary rendition or, at minimum,exiled here on Elba (which is actually a pretty nice place). But seriously, all the rage are “big Bayes success stories”. They are happy to criticize within the family (see my deconstruction of Berger currently on my 3 yr memory lane post), but one gets the sense that any criticism, from a frequentist especially, is likely to evoke sensitivities ingrained from 40 or 50 years ago when subjective bayesians were criticized.