I could have told them that the degree of accordance enabling the “6 principles” on p-values was unlikely to be replicated when it came to most of the “other approaches” with which some would supplement or replace significance tests– notably Bayesian updating, Bayes factors, or likelihood ratios (confidence intervals are dual to hypotheses tests). [My commentary is here.] So now they may be advising a “hold off” or “go slow” approach until some consilience is achieved. Is that it? I don’t know. I was tweeted an article about the background chatter taking place behind the scenes; I wasn’t one of people interviewed for this. Here are some excerpts, I may add more later after it has had time to sink in. (check back later)

**“Reaching for Best Practices in Statistics: Proceed with Caution Until a Balanced Critique Is In”**

J. Hossiason

“[A]ll of the other approaches*, as well as most statistical tools, may suffer from many of the same problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? Should scientific discoveries be based on whether posterior odds pass a specific threshold (P3)? Does either measure the size of an effect (P5)?…How can we decide about the sample size needed for a clinical trial—however analyzed—if we do not set a specific bright-line decision rule? 95% confidence intervals or credence intervals…offer no protection against selection when only those that do not cover 0, are selected into the abstract (P4). (Benjamini, ASA commentary, pp. 3-4)

** What’s sauce for the goose is sauce for the gander right? **Many statisticians seconded George Cobb who urged “the board to set aside time at least once every year to consider the potential value of similar statements” to the recent ASA p-value report. Disappointingly, a preliminary survey of leaders in statistics, many from the original p-value group, aired striking disagreements on best and worst practices with respect to these other approaches. The Executive Board is contemplating a variety of recommendations, minimally, that practitioners move with caution until they can put forward at least a few agreed upon principles for interpreting and applying Bayesian inference methods. The words we heard ranged from “

**go slow**” to “

**moratorium**“ [emphasis mine]. Having been privy to some of the results of this survey, we at Stat Report Watch decided to contact some of the individuals involved.

* Background information via priors. *Donald Berry declared that “standard Bayesian data-analytic measures have the same fundamental limitation as p-values” for capturing background information, adding that: “subjective Bayesian approaches have some hope, but exhibiting a full likelihood function for non-quantifiable data [regarding context] may be difficult or impossible”.

Most of the practitioners in machine learning that we interviewed echoed the late Leo Breiman, that “incorporating domain knowledge into the structure of a statistical procedure is a much more immediate…approach than setting up the [full] Bayes’ machinery” (1997, p. 22).

In stark contrast, many Bayesians maintain, with Christian Robert, that “The importance of the prior distribution in a Bayesian statistical analysis is …that the use of a prior distribution is the best way to summarize the available information (or even the lack of information) about this parameter. Yet at the same time, oddly enough, instead of using prior probability distributions to introduce background beliefs into the analysis, members of the most popular Bayesian school these days, alternatively called default, reference, or O-Bayesianism, are at pains to show their priors are as *uninformative* as possible! Here, prior probability distributions arise from a variety of formal considerations that in some sense give greatest weight to the data. There is a panoply of variations, all with their own problems; the definitive review is Kass & Wasserman, 1996. So there’s scarce agreement here.

……….

** O-Bayesian, Subjective Bayesian, Pseudo-Bayesian. ** Jim Berger, a leader of the O-Bayesian movement, told us: “The (arguably correct) view that science should embrace subjective statistics falls on deaf ears; they come to statistics in large part because they wish it to provide objective validation of their science”. Using default priors not only short-circuit arduous elicitation, he thinks the use of conventional priors combats a new disease that appears to be reaching epic proportions, which he dubs “pseudo-Bayesian” subjectivism.

Berger describes this as “a very adhoc version of objective Bayes, including use of constant priors, vague proper priors, choosing priors to ‘span’ the range of the likelihood, and choosing priors with tuning parameters that are adjusted until the answer ‘looks nice’.”

Perhaps they can agree on a principle prohibiting pseudo-Bayes, but the funny thing is, no one we interviewed gave the same definition as Berger!

We were pointed to a paper by Donald Fraser 2011 where he says only the frequentist prior is really “objective”, but in that case, it “can be viewed as being probability itself not Bayesian.”

Stephen Senn told us he’s irked to often find authors making claims such as ” ‘an advantage of the Bayesian approach is that the uncertainty in all parameter estimates is taken into account’” while in practice, he finds none of the priors reflect what the authors believe, but are instead various reference or default priors. “… it is hard to see how prior distributions that do not incorporate what one believes can be adequate for the purpose of reflecting certainty and uncertainty.”

Andrew Gelman was frank with us: “If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.”

Stephen Senn also had a humorous twist: “As an exercise in mathematics [computing a posterior based on the client’s prior probabilities] is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’.”

……………

** A Bayesian wants everyone else to be non-Bayesian.** That’s a section straight from Gelman (who says he

*is*a Bayesian). “Bayesian inference proceeds by taking the likelihoods from different data sources and then combining them with a prior distribution (or, more generally, a hierarchical model). The likelihood is key. . . . No funny stuff, no posterior distributions, just the likelihood. . . . I don’t want everybody coming to me with their posterior distribution—I’d just have to divide away their prior distributions before getting to my own analysis. Sort of like a trial, where the judge wants to hear what everybody saw—not their individual inferences, but their raw data.”

We’d heard Gelman wasn’t a friend of the standard subjective Bayesian approach, but we were surprised to hear him say he thinks it has “has harmed statistics” by encouraging the view that Bayesian statistics has to do with updating prior beliefs by Bayes’s Theorem. *Sure, but wouldn’t that be correct*? The consequence of this that really bothers Gelman is that it’s made Bayesians reluctant to test their models. “To them, any Bayesian model necessarily represented a subjective prior distribution and as such could never be tested. The idea of testing and p-values were held to be counter to the Bayesian philosophy.”

He directed us to a co-author, Cosma Shalizi, a statistician at CMU, who echoed Gelman in flatly declaring “most of the standard philosophy of Bayes is wrong”.

Gelman even shows sympathy for frequentists: “Frequentists just took subjective Bayesians at their word and quite naturally concluded that Bayesians had achieved the goal of coherence only by abandoning scientific objectivity. Every time a prominent Bayesian published an article on the unsoundness of p-values, this became confirming evidence of the hypothesis that Bayesian inference operated in a subjective zone bounded by the prior distribution.”

Possibly, the recent ASA p-value forum would fall under that rubric?

….

** Problems of Interpretation**. An important goal of the p-value principles was to highlight misinterpretations. What principles do the insiders put forward to guide Bayesian interpretation? Larry Wasserman thinks there are mistakes equally on both sides.

Although a common “complaint is the people naturally interpret p-values as posterior probabilities so we should use posterior probabilities. But that argument falls apart because we can easily make the reverse argument. Teach someone Bayesian methods and then ask them the following question: how often does your 95 percent Bayesian interval contain the true value? Inevitably they say: 95 percent. … In other words: people naturally interpret frequentist statements in a Bayesian way but they also naturally interpret Bayesian statements in a frequentist way.” If you ask people if it means 95 percent of the time the method gets it right, they will say yes, even though the frequentist coverage claim might no longer hold.

Each of the statisticians we talked to had a different answer to the question, “what’s the meaning or function of a prior probability?” Answers ranged from “it’s how much you’d bet”, “a hunch or guess”, “a way to add background beliefs to the analysis”, “an invariant reference to avoid bringing beliefs into the analysis”, “a way to regularize messy data”, “a way to get posteriors that match frequentist error probabilities”, to “an undefined concept whose sole use is to compute a posterior probability“.

…..

* Bayes hacking and transparency. *An important principle (4) in the ASA document on P-values emphasized the biases that arise in “conducting multiple analyses of the data and reporting only those” having results the researcher finds desirable. At least 2 of the people we interviewed expressed concern that the same selection effects could enable finagling Bayes factors and Bayes updating. The prior could well be changed based on the data, and the hypothesis selected to compare with H can be deliberately chosen to be less (or more) likely than H. Nevertheless, less than half the practitioners canvassed thought that a principle on “full reporting of selection effects”(4) could be justified for Bayes factors, likelihood ratios, and Bayes updating. Those who did, recommended an adjunct “error control” requirement for Bayesian inference.

Larry Wasserman put it bluntly: “If the Bayes’ estimator has good frequency-error probabilities, then we might as well use the frequentist method. If it has bad frequency behavior then we shouldn’t use it. ”

Those who didn’t, explained that the result would still be sound, because of “calibration” and “given what you believed”.

For instance, Jim Berger, even though he worries about pseudo Bayes told us that his favorite Bayesian conditional error probabilities “do not depend on the stopping rule” even if it allows stopping only when a credible erroneously excludes 0.

In his famous book, *The Likelihood Principle (*1988), jointly with Wolpert, even though the stopping rule ensures that the Bayes credible interval will erroneously exclude 0, they claim any “‘misleading’, however, is solely from a frequentist viewpoint, and will not be of concern” to a Bayesian, given what he believed to begin with (p.81). Irrelevance of the stopping rule and selection effects follow from the Likelihood Principle, which follows from inference by Bayes Theorem, except that it too is violated in default or reference Bayesianism.

…..

* Coherence. *One would think that all Bayesians agree on at least one principle:

*coherence*, in the sense of betting. But, no.

“Betting incoherency thus seems to be too strong a condition to apply to communication of information,” says Jim Berger. Even subjective Bayesianism is not coherent in practice, he says, “except for trivial versions such as always estimate θ ∈ (0, ∞) by 17.35426 (a coherent rule, but not one particularly attractive in practice)…In practice, subjective Bayesians will virtually always experience what could be called practical marginalization paradoxes” where posteriors don’t even sum to 1. But then it cannot be a posterior probability.

…….

** Bayes Factors. **Most Bayesians appear to think the Bayes factor (essentially, the likelihood ratio of two hypotheses or models), is the cat’s meow. But Jose Bernardo, a leader in the “reference Bayesian” movement, avers that “Bayes factors have no direct foundational meaning to a Bayesian: only posterior probabilities have a proper Bayesian interpretation.” While a main contrast between frequentist tests and Bayes factors occurs with the Bayesian assignment of a “spiked” prior to a point null, Bernardo opposes this distinction with confidence intervals, with their smooth reference priors.

Many people we spoke to tout popular default Bayes factor programs because they enable support for a null hypothesis, unlike in a significance test. However,Uri __Simonsohn__ told us that it is “prejudiced against small effects.”

He explained to us that it’s actually incorrect to claim support for the null hypothesis; it’s always just a comparison.“What they actually ought to write is ‘t*he data support the null more than they support one mathematically elegant alternative hypothesis I compared it to’.”* The default Bayes factor test “means the Bayesian test ends up asking: ‘*is the effect zero, or is it biggish*?’ When the effect is neither, when it’s small, the Bayesian test ends up concluding (erroneously) it’s zero” with high probability!

“Saying a Bayesian test “supports the null” in absolute terms seems as fallacious to me as interpreting the p-value as the probability that the null is false.”

At this point, Gelman reminds Simonsohn that “*You can spend your entire life doing Bayesian inference without ever computing these Bayesian Factors”.*

…………..

What’s the forecast for this next step in regulating statistical meaning? The ASA p-value document took over a year; they envision this next step will take at least twice as long. Given the important community service this would provide for those seeking to understand the complexities of the popular Bayesian measures of evidence, it’s well worth it. In the mean time officials are likely to advise practitioners to check any Bayesian answers against the frequentist solution for the problem and perhaps employ several different Bayesian approaches before putting trust in it. [To read the rest of the article see this link.]

*From the ASA P-value document.

****************************************************************************

I notice they didn’t canvass anyone on Gelman’s Bayesian data analysis. Gelman and Shalizi‘s joint article (2013) suggests that “implicit in the best Bayesian practice is a stance that has much in common with the error-statistical approach of Mayo (1996), despite the latter’s frequentist orientation. Indeed, crucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense”. So maybe the entire exploration will take us full circle and bring us back to error statistics, but a far more sophisticated and inclusive version. That’s all to the good, even if this new “hold off until we get back to you” creates a degree of confusion in the mean time.[1]

[1] This post was written jointly with Jean Miller.

**To understand the meaning of this post, check the date it was posted.**

Would today’s “Statistical Establishment” really entertain a balanced critique of Bayes ratios, Bayes updating, likelihood ratio and “other” methods? No caveats attended the implicit permission to “supplement or replace” error statistical hypotheses tests with a melange of methods so amorphous that it’s not even a clear that ANY best or worst practices would be forthcoming. And it’s only the more precautionary forces that blocked a more cavalier attitude toward accepting different schools against frequentist significance tests. I do not say that anything remotely so clear-cut was involved–not at all! My impression is that contrasting logics, philosophies, and contexts weren’t directly considered. I have no doubt they were trying their best; the mix of statistical, domain-specific, historical, conceptual, political, and philosophical differences would first have to be recognized. I feel grateful for having had the chance to comment. But you never know, maybe in the future….

Hence the appropriateness of the date of this post.

Mayo:

As I wrote in my official comment on that official statement, I think the problem is not so much with p-values (though they do have problems, most notably in their difficulty of interpretation in the setting where the null hypothesis is not true) but rather with the notorious NHST. I do think that if you try to do NHST with Bayes factors or likelihood ratios or confidence intervals or posterior intervals or whatever, all those NHST problems will still be there.

Andrew: What would it mean to do NHST with Bayes factors? Do you mean you might use the data to locate a hypothesis, call it H’, that makes the data probable, so that the Bayes factor in favor of H’ against the null is impressively high? I’d never gotten around to asking you this question.

Of course, I must confess that the idea that the officials are likely to advise practitioners to “go slow” until a balanced critique of the other approaches is available, or “are likely to advise practitioners to check any Bayesian answers against the frequentist solution for the problem” strikes me as so far-fetched that this post, I thought, would be appropriate only for the date on which I did post it.

“Would today’s “Statistical Establishment” really entertain a balanced critique of […] and “other” methods?” No. It is pretty clear that there are serious mistakes in the foundations of conventional accounts of many of the methods, but there is absolutely no acceptance of that fact. Every idea is assessed according to preconceived and pre-accepted god-given truths of statistics and so no repairs to the foundations are possible and no connections between methods are allowed.

For example, my attempts to show how P-values and likelihood functions are related met with stiff opposition from four journals (version 2 is arXived http://arxiv.org/abs/1311.0081) with equally ridiculous and or mistaken comments from the many referees involve, but very little commonality in their criticisms. Similarly my attempts to show how the likelihood principle is over-applied and misinterpreted to the extent of licensing the silly comparisons of likelihoods from different statistical models has been skewered by referees who clearly disagreed strongly with each other! arXived here: http://arxiv.org/abs/1507.08394

Everyone who reads this blog must have a sufficient interest in the problem that he or she should do their best to see the the cracks and to allow repairs to be made, but even here we seem to have trouble finding common ground because of entrenched opinion and because of the miscommunication of ideas using words.

Michael: “Every idea is assessed according to preconceived and pre-accepted god-given truths of statistics and so no repairs to the foundations are possible and no connections between methods are allowed.”

So do you think that’s what happened, at least to some degree, at the P-value pow wow? In any event I’m sympathetic to your gripe, and hope you’ll make it known to the powers that be.

My attachments have begun to show only flickers and black-outs for the past 4 days, but I don’t know how to fix it (anyone see this before?) and have to leave for NYC in 2 days. So I can’t read these archival papers just now. I would think the relationship between p-values and likelihoods would be largely a technical matter, so I’m not sure why it would be a matter of foundational principle or philosophy. (My NYC computer should be OK, for the few days I’m there).

Again I concur with your complaint that people are often stuck in a conception that you may be trying to dislodge. After all, I’m here in exile on a Saturday night. For just one of dozens of exs., I’ve had to say, over and over and over again, that I was providing a different construal than, say, Neyman, for anyone to not merely repeat their conception of Neyman’s tests. I had to first confess all the extreme behavioristic views that one might take away from him, for example. And I’ve been at this a long time.

One obstacle is cross-disciplinarity. We assume it will be much easier than it is. I don’t know which is more difficult, from philosophy to stat (or econ or psych) or from stat (or econ or psych) to philosophy. I spent years on the Birnbaum article before publishing it in Stat Science.

One thing on the likelihoods, and I know this has come up before, the problem persists even with the “same stat model”. The rule in philosophical criticism, however, is that if a supporter of view V is making a a broad claim, you first criticize the broad claim. If likelihoodists (and the LP) hadn’t made such a broad claim,the literature on “counterexamples of likelihood” would have been restricted. Then you allow that even if it were restricted, the problem or a similar problem still holds. The fact is that other leading likelihoodists make the broad claim. For the restricted claim, I recently saw the ex. I wanted to mention to you: Cox and Hinkley p. 51 Ex 2.41.

Finally, there are conflicts of interest, professional and political. If so and so has put all his apples into a book on confidence intervals as the universal panacea, for example, then he can’t budge, even if you show him the duality between tests and CIs. (He might not even cite you if he got ideas from you.) Or if your job as an expert witness demands you ignore error probs, then that’s it. About to lose power, so I’ll stop—lightening on the island.

There is no problem in interpreting p-values when the null hypothesis is not true which is always the case. A p-value is simply one measure of the degree of approximation of the model specified by the null hypothesis to the data. There are of course other measures of approximation which may be more appropriate depending on the circumstances.

Laurie: Not sure about its being a measure of degree of approximation as opposed to, say, an indication of discrepancy from Ho. Even aside from the issue discussed at length in recent blogs as to what’s in the null, I don’t understand degree of approximation here. In a one-sided test of a Normal mean, say (mu ≤ 0 vs mu > 0), an observed result of 0 would get p-value of .5?

Deborah, you have not provided enough information, no sigma. Take the models to be N(mu,1) with mu<=0. Choose an alpha to quantify 'typical'; the larger alpha the weaker the concept of approximation. Given this a typical value X under N(mu,1) satisfies X <=mu+qnorm(alpha). Thus the data x will 'look like' a typical X under N(mu,1) if x =mu>=x-qnorm(alpha)}. If x > qnorm(alpha) the approximation region is empty. The p-value is 1-alpha^* where alpha^* is the smallest value of alpha such that the alpha-approximation region is non-empty. That is alpha^* determines the how weak the concept of approximation has to be in order for there to be an adequate model for the data x. In this case qnorm(alpha^*)=x so that the p-value is 1-pnorm(x). This example is however too simple to be interesting. For a sample of size n one would define ‘look like’ in terms of say the behaviour of the mean, the behaviour of the variance, the lack of outliers and the distance of the empirical distribution from the model distribution as measured say by the Kolmogorov metric.

Laurie: Sorry, I’m not sure about how to read your notation, I was on the outskirts of that earlier discussion. But now I think the main issue is whether the typical significance test is testing the approx of the model, and so we’re back to that earlier issue after all.

This strikes me as missing the point, just like NHST misses the point. All of this about priors and cutoffs and objectivity is arcane trivia that allows ignoring the main issue. The underlying problem is that, for most use cases, rejection of the default null hypothesis provides little to no information about the validity of the research hypothesis.

Until statisticians recognize the primacy of this issue and stop teaching people to test the default null hypothesis, statistical tests (of any form) will continue to hinder scientific progress by misleading people who want to assess some research hypothesis, or wasting the time of those who do understand but have to deal with their confused colleagues.

Anon: As I’ve always said, something some call NHST, invented I guess for people seeking an abusive form of tests that supposedly goes from statistical effects to research claims, goes against what both Fisher and NP explicitly said See my commentary on p-values.

https://errorstatistics.com/2016/03/07/dont-throw-out-the-error-control-baby-with-the-bad-statistics-bathwater/

Worse, many of those fields are pseudosciences a they are presently practiced, unless the experiments and measurements performed are picking up on the intended phenomenon. No statistics can help them!

However,with genuine sciences, finding a real effect, triangulating with known measuring tools and theories is often a crucial first step. Assessing extents of discrepancies is another. The same piecemeal probes, together with substantive knowledge (theoretical and instrumental) form the basis for scientific inference and discovery from prions to gravity waves. In fields that don’t use formal statistics (ie.., most fields), the analagous testing reasoning and arguing from error applies.

The issue here is more nearly the one I raised in “whipping boys and witch-hunters”: https://errorstatistics.com/2011/09/26/whipping-boys-and-witch-hunters-comments-are-now-open/

except that much of the scapegoating, at that time (a mere few years ago), was limited to questionable sciences and wild abuses of tests. The data-dredging and cherry picking known as no no’s in the past grew to industrial proportions (as Taleb says someplace), and cavalier practitioners in the social sciences actually said it was allowed. A number of outspoken “reformers” set out their shingles, despite their never having understood tests, and who still define power incorrectly, and became famous p-bashers.

Mean time, subjective Bayesianism is ousted in favor of “default” Bayesian methods, thanks to J. Berger’s warnings that scientists would never stand for this subjective stuff (2003, 2006). Default priors–& there are many competing systems to choose from– are just undefined whatzits that allow a posterior to pop out. Bayesianism is PC, and even retains a bit of the patina of philosophical depth of subjectivism of old.

Thanks to some bad behavior in medical research the old scapegoating takes on a new seriousness. Diagnostic screening models are evoked to estimate the “positive predictive value” of your study by viewing it as having been selected from a huge industrial urn of journals and hypotheses, assumed to contain a high prevalence, at least 50%, of spurious effects and QRPs.

The heavy weaponry is brought out to fight an ESP buffs (who dared to publish in a regular journal,) the social primers (who sometimes used no data at all) and now, also, the non-replicating medical researchers. The scapegoat is, of course, significance tests and their abuse. The new fashions are, among others: replicating non-replication, technical “reforms” (some quite welcome, eg., preregistration), fraud-busting and statistical forensics (also welcome*), and technical activism institutes (built on a given statistical philosophy but purporting to only be urging sound scientific principle). The same questionable sciences remain, and are given largely unvetted “other methods” that may well bury the bad behavior, or just present you with a slew of priors, models, and likelihoods and invite you to choose for yourself, all under the banner of “transparency”.

*Note the fraud-busting and forensics employ frequentist tests and analyses, sometimes straight from Fisher.

>”The scapegoat is, of course, significance tests and their abuse.”

Yes, for good reason. If you read the papers, take the stats classes, and talk to these people you will find they are taught to “test the null hypothesis” and then go on to misinterpret the results because they believe it to be an important activity. Meanwhile, coming up with a quantitative model that would be worthwhile to test is discouraged. Reviewers, etc tell people put that (which requires much more work and is much more useful) into an appendix.

I don’t know if it is possible, but I recommend you sit in on some psych/sociology/medical journal clubs at your university and observe what is going on. It is ridiculous that I need to be apprehensive of going to the doctor because they make decisions based off a literature filled with misinterpreted p-values, so I really hope this issue can be cleared up.

At least in the corner of astrophysics I used to work in, the Bayesian methods for both parameter estimation and model comparison included priors that actually encode important model information. Have LIGO researchers or astronomers in general given up on alternative methods? No. But what is used is decided on a case by case basis according to a number of criteria. So in the end the community has already moved on from this seemingly fruitless discussion. Well I think it has.

Particle physicists are another story…

MattW: How does it encode important model information? Or rather, what is the information?

The parameter estimation [PE] paper (http://arxiv.org/abs/1602.03840) for GW150914 includes a short discussion of the motivation of each of the priors used in the MCMC & Nested Sampling analyses of the signal.

One example is that sources are presumed to be uniformly distributed in volume, which at distances of 100s of Megaparsecs is pretty reasonable given our understanding of the density of galaxies. This means that the number of sources grows as a cube of the distance, but sadly they are also quieter in terms of signal strength. Another is that sources are presumed to be uniformly distributed across the sky and again this goes back to our understanding about the distribution of source galaxies in the universe. It also makes sense given that the sensitivity of these interferometers is more akin to an antenna than a directional telescope.

So yeah, there are priors (like coalescense time) that constructed partly for ease of calculation, but that is definitely not the primary driver in every case.

MattW: Will look at. so the priors have a basis in frequencies.

Mayo: The sampled and therefore analyzed galaxies are a discrete set on their own, but that in and of itself doesn’t seem too meaningful. One then can infer the presumably continuous density from that discrete set. You’d need a prior on the distribution as a function of distance, so maybe at that point you’d construct an uninformative prior in an effort to minimize the bias based on assuming too strong reliance on theory. But at this point, assuming uniform in co-moving volume is a very safe assumption.

So to your statement, ummm kinda… but not completely.

I’d just say it was empirically based.

MattW: In the end though, I take it you want to make astronomical inferences, or arrive at models that capture the phenomenon of interest. And you want to do so with methods rather good at revealing if the models err in important way–ways that could misdirect us. This is not to assign a posterior probability to the model, except in informal Enlgish, where we mean something like, we can infer this is approximately what’s going on. The form would depend on the problem. You’re not betting on the model (at least that’s not what statistics does for you) but you think such and such portions or claims are warranted, others less well probed, etc.and you’ve a good indication as to what questions remain open.

Mayo: The search problem is handled by another analysis suite that follows the rubric of the Neyman-Pearson lemma. What I am describing is the parameter estimation problem for a fixed class of models (signal waveforms). The objective is to determine the combined posterior of model parameters assuming there is a signal in that small chunk of time and General Relativity [GR] is True. While there are tests to bound the results for expected events as well as noise and non-GR signals, the point from the paper linked above is NOT to disprove a class of models. It’s to find the best fit parameters and their associated credible intervals.

While I am focusing on this one paper, I know this same Bayesian methodology is used across physics’s myriad of sub-disciplines with impressive results. http://arxiv.org/find/grp_physics/1/all:+bayesian/0/1/0/past/0/1 You are starting to see even some particle physics project start to utilize it.

Deborah, notation qnorm, pnorm are standard R sometimes written as Phi^-1 and Phi respectively if that is your problem. I cannot think of any other way of interpreting a P-value. Look at the confusion in

http://andrewgelman.com/2016/01/08/why-inversion-of-hypothesis-tests-is-not-a-general-procedure-for-creating-uncertainty-intervals/

and

http://andrewgelman.com/2011/08/25/.

You write (I have commented on this before)

(1) Pr(Test T produces d(X)>d(x); H0) ≤ p

where x the data is held constant so to speak and X the random variable is distributed according to the distribution defined by H0. A very small P-value simply means that the distribution of X is a very poor approximation for x as measured by the statistic d. That is d(x) does not look like a typical values of d(X), in this case it is much larger. This can of course be made precise. What else could it mean?

Your idea of error probabilities and severe testing is something I agree with. However the severe testing has to be extended to the choice of model whereas your version only operates within an already chosen model. Severe testing of the model is necessary so to speak but not sufficient. Severity depends on the chosen model and consequently can be imported, Tukey calls this a free lunch. You have to regularize and this should be made explicit.

Laurie, is this what you mean by “regularization”?

“In simple terms, regularization is tuning or selecting the preferred level of model complexity so your models are better at predicting (generalizing). If you don’t do this your models may be too complex and overfit or too simple and underfit, either way giving poor predictions.” [From Toby Kelsey’s answer here: http://stats.stackexchange.com/questions/4961/what-is-regularization-in-plain-english%5D

If so, then I don’t immediately see how it could be necessary to “regularise” if the model has only one or two parameters when there are many data points. Most of the models used in Mayo’s severity work are trivially simple and so I do not see how regularisation is required.

Michael: I use exactly the examples employed in the major discussions of foundations for years on end. I agree with Berger and other “reformers” that if we can’t understand the foundations of the simplest examples (which happen to reflect general but rival logic of statistical inference), then we’ll surely fail in adjudicating more complex cases. The issues of reliable prediction and replication arise even in these simplest cases, as well as the most complex ones.

Michael – as soon as you write down a model with one or two parameters you’ve immediately made arbitrarily many assumptions about what to neglect. On the other hand this often seems like a good idea.

How do we justify which model to write down and then ‘work within’? From a ‘foundations of inference’ point of view how should we understand this? Overfitting vs underfitting, bias-variance tradeoffs, max ent, ‘hornless’ models? Etc.

Although Laurie often mentions regularisation the nicest thing (in my opinion/understanding) about Laurie’s approach is that it is often much more based on the type of mathematics and reasoning developed in other areas for extracting key qualitative (and hence robust) concepts from messy quantitative details.

Consider how a mathematician analyzes a simple chaotic discrete map. They study the topological properties and stability of the phase portrait, not individual trajectories. Each individual is irregular but you can still ask ‘what are the invariant sets to which trajectories are attracted and stay in once there?’. These have much more robust answers and emphasise the qualitative features of interest at the expense of individual trajectories. If you must choose an individual trajectory to follow then you need to ensure it is an ‘honest representative’ of other equally a priori possible trajectories.

The difference or differential equation loses primacy and gives way to the phase portrait as an independent object, or as an object characterising a set of models giving topologically equivalent results.

Emphasising the ‘weak topology of exploratory data analysis’ as Laurie does is to me basically this same sort of process. The likelihood or specific generating mechanism etc gives way to the qualitative stable features of interest (you might prefer to imagine them as functionals of the likelihood function for example. Not sure how far this extends) and the class of models that are topologically equivalent (all within adequate neighborhoods) with respect to these features.

omaclaren, I have only the vaguest notion of what you are writing about. It seems to me that Laurie’s (and now yours?) concerns about Mayo’s model represent a transference of concern about his own complicated system onto the simple system that Mayo is talking about. An unnecessary complication perhaps. Certainly it seems to me to be a distraction from the issues at hand.

Michael, the reference you gave does give some examples of regularization, ridge regression and lasso both of which contain a penalty term and can be seen as examples of Tikhonov regularization. Minimizing the number of parameters is also a form of regularization which I suppose can be seen as Tikhonov regularization with an L_0 penalty term. However there are other forms of regularization not of this form, for example minimizing the number of peaks in non-parametric regression subject to multi-scale constraints.

The main example not of either form is the location-scale problem. The model is F((*-mu)/sigma) where you can now choose F. Once F is given there are only two parameters mu and sigma but the choice of F is free in which case there are many parameters. Of course you can choose F to mimic the data, a kernel estimator for the density with a small bandwidth. You remember the comb distribution. This does not mimic the data, it is very close to the normal distribution so you have a problem without mimicking the data. If you use maximum likelihood then the variance of the location estimator is inversely proportional to the Fisher information and the larger this is the more precise the estimate and consequently the more severe the testing. This precision or severity is imported from the model and unless one has very good reasons for this one should not import precision through the choice of model. This leads to minimum Fisher regularization and for example to the normal model, but there are others. This leads to the highest precision under the most difficult circumstances: it is difficult to estimate the mean of Gaussian data, the best one can do is to use the mean. In practical terms this is of little importance because people use the Gaussian model as a form of default model. Nevertheless one should understand what is going on and why the Gaussian model is in general a good choice.

My own preference is not to regularize the model but to regularize the procedure. Thus in the location-scale case I prefer M-functionals which allow one to analyse data with outliers. To argue in favour of this would take up too much space.

More important and of practical importance is the choice of model and, in the case of a parametric family, the specification of those parameter values which are consistent with the data. There seems to be a great reluctance to do this. I suspect that the reason is that doing so would would greatly reduce the importance of likelihood.

Deborah, well yes, that is the view from within where within may be loosely described as ‘behave as if true’ in conjunction with Bayes, de Finetti, Fisher, Neyman, Pearson and likelihood. Looked at from without it as I do it all looks pretty strange. You write

Error (Probability) Statistics

What is the key on the statistics side: the probabilities refer to the distribution of the statistic d(x) (sampling distribution) – applied to events.

In your notation x is the data. You do not have the remotest chance of knowing the distribution of d(x). All you can do is model it. The model will not be the truth, it is at best an approximation. Why not accept it as such and do so consistently?

Actually it is clear why this is not accepted. No true models means no true parameter values, only adequate one, approximation cannot be represented by a Bayesian prior over the parameter space, a confidence region cannot be defined by its covering probability of the true parameter value, no probability of a hypothesis being true, approximate the data as given (know apparently as the weak likelihood principle) although the definition of approximation may make no mention of likelihood etc.

In my opinion what is needed is a paradigm change, from truth to approximation, and this applies to the simplest cases. Paradigm changes take a long time but the ASA report on p-values indicates to me at least a paradigm in crisis. Are there any philosophical arguments against the use of the concept of approximation in statistics?

The one new innovation in statistics in the last few years is machine learning. It does not fit into the

standard paradigm as it not restricted to parametric families. Indeed I know of no case where a standard parametric model is used but there may be such cases. A probability model here is a single measure P which is the same way I use the word. My potted history is that it was ignored instead of being welcomed by the Bayesian-Fisher-Neyman complex because it simply did not fit into their paradigm. It was however taken up by the computer science people which is where it is now situated, a loss for the statistics community but their own fault.

Laurie, I still don’t really get it. And I have to admit that I do not understand your point in your first paragraph at all: the article may deal with complicated systems, but Mayo’s system is simple. I do have some responses to snippets of your writing that may make my misunderstandings clear:

“The model is F((*-mu)/sigma) where you can now choose F.”

Response: Mayo chose a Gaussian model with known sigma. A Gaussian model is reasonable in a great many circumstances (at least when allowing unknown sigma). Examples from my own experimental experience include the blood pressures of rabbits, the diameters of mesenteric arteries, the changes in diameter of arterioles upon pharmacological challenge, the logarithm of concentrations of numerous blood-borne mediators, the logarithm of the EC50s of agonist concentration-response curves, and many more.

“Of course you can choose F to mimic the data […] You remember the comb distribution.”

Response: Yes, you could, but I would never advise anyone to choose the model on the basis of a small dataset. My observation above of the reasonableness of a Gaussian model is based on the aggregate experience of many datasets. (Note that I do not claim that the data are actually normally distributed —none of them are strictly normal— just that the normal distribution is close enough.) Yes, I remember the comb, but I do not recall that you presented a case that such a distribution was ever a reasonable choice for analysis. It is pathological in ways that are obvious and so it is not a good choice.

“If you use maximum likelihood then the variance of the location estimator is inversely proportional to the Fisher information and the larger this is the more precise the estimate and consequently the more severe the testing.”

Translation: The steeper the likelihood function near the peak (and thus the steeper the slope of the severity curve), the more precise the estimate. Yes, I agree. I’m not sure that we need to talk about Fisher information to make the point, however, as it is obvious from the shape of the likelihood and severity curves.

“This leads to minimum Fisher regularization and for example to the normal model, but there are others. This leads to the highest precision under the most difficult circumstances: it is difficult to estimate the mean of Gaussian data, the best one can do is to use the mean.”

Response: What? Is a Gaussian model a “difficult circumstance”? To get a better estimate of the Gaussian distribution mean than the sample mean is to introduce additional information. It would need to be information from outside the dataset, presumably.

“In practical terms this is of little importance because people use the Gaussian model as a form of default model. Nevertheless one should understand what is going on and why the Gaussian model is in general a good choice.”

Response: OK, so you are raising a storm in a teacup? Do I need to understand regularization that is not needed for my models in order to understand that my models are a good choice? I don’t think so. Do I need to understand regularization to understand that a comb distribution, or any distribution with ambiguous and redundant parameters is a poor choice for a likelihood analysis or regression? Apparently not.

“More important and of practical importance is the choice of model and, in the case of a parametric family, the specification of those parameter values which are consistent with the data. There seems to be a great reluctance to do this. I suspect that the reason is that doing so would would greatly reduce the importance of likelihood.”

Response: I really doubt that the reluctance to examine in depth the choice of parametric family has anything at all to do with likelihood! Many statisticians have only the most vague notion of the centrality of likelihood to statistically supported inferences, and most would deny that a likelihood function has any direct utility. As for scientists who use statistics in their everyday analyses, few of them would even be able to say how likelihood is anything other than a synonym for probability. More likely the reluctance comes from a belief, valid or not, that the ordinary choices of parametric family are good enough.

Consider a model set up of the form

p(y|a,b)

= int{ p(y|x)p(x|a,b) }dx

This gives a map from (a,b) values to distributions over y.

Likelihoodists and Bayesians fix a,b as ‘background info’ and work ‘within’ this setup to draw conclusions about x given some observed data y0.

Eg x might represent a parameter vector of location and scale parameters, a and b represent assuming a Gaussian model form.

Laurie asks effectively – how would your conclusions about x be affected by choosing different ‘a’ and ‘b’ values but still keeping the location and scale parameters as primary. Ie he asks an ‘external question’ wrt the above setup.

The typical Likelihoodist or Bayesian responds by saying – if you want to compare models with different ‘a’ values (say) then you need to start from a model of the form

p(y|b)

= int{int{ p(y|x,a)p(x,a|b) }da}dx

Ie the variable background ‘a’ variables are now treated as nuisance parameters and treated ‘within’ an enlarged model.

If you assume that we still have a location-scale model with parameters of interest x, then use

int{int{ p(y|x,a)p(x,a|b) }dx}da

=

int{int{ p(y|x)p(x,a|b) }dx}da

Ie assume ‘x’ are ‘predictively sufficient’ regardless of ‘a’. This gives

int{int{ [p(y|x,a)-p(y|x)]p(x,a|b) }dxda

= 0

This is simply an orthogonlity condition. All location-scale models of this type are thus equivalent up to a component orthogonal to p(x,a|b) = p(x|a,b)p(a|b).

Unfortunately it is clear from the above that your conclusions about x will depend on ‘a’ in general. If you want to choose a specific model – ie fix a specific ‘a’ value – for which x has least dependence on this ‘a’ choice then you could regularise by eg choosing a model with least Fisher info leading to the Gaussian model.

So – Laurie asks an ‘outside’ question and the Likelihoodist or Bayesian responds by converting it to an ‘inside’ question relative to an enlarged model.

From experience/mathematics we know that reducing this large model to a Gaussian is often good because many models are well approximated by the same Gaussian and so for these models our conclusions shouldn’t depend on their differences.

This seems an acceptable response to me and is how I interpret Gelman’s philosophy.

Alternatively though you can attempt to address these in a new way. Laurie’s approach and, as he mentions, the machine learning folk, are developing new approaches starting from a different perspective. In a sense it is a way of trying to formulate the above process itself.

(Note that the choice of minimum Fisher info is again – I believe – an orthogonality assumption.

This paper of Cox and Reid (1987) ‘Parameter orthogonality and approximate conditional inference’ (PDF: http://yaroslavvb.com/papers/cox-parameter.pdf) might be of interest but I haven’t read it beyond a skim.

The general idea seems to follow from first principles fairly straightforwardly but there is probably a whole literature of subtleties that I haven’t read).

Hi Laurie, I have a side question for you.

You say a model is for you a single measure. Why not equate a model with a single probability space/triple and include also the sample space and collection of events?

I find this makes some things less ambiguous. We do the same thing with functions (domain and codomain are included) and eg differential operators (BC included). I get the impression that some (but only some) debates implicitly involve the other model ingredients and not just the measure.

Michael, here is Tukey (1993) on the same topic, maybe you find him clearer. The paper was referred to by David Brillinger in his Tukey obituary in the Annals of Statistics.

\section{B is for Blandness}

Davies emphasizes the pathological discontinuity of estimate behavior as a function of assumed model. In any neighborhood of a simple model (location, or location- and-scale, say) there are arbitrarily many (similarly simple) models with arbitrarily precise potentialities for estimation – – potentialities that can only be realized by very different procedures that [sic] those that realize the potentialities of the original simple model.

There are always nearby models that “describe the data” – – and are arbitrarily easier to estimate from. There may or may not be models that “describe the data” and are harder to estimate from.

A slightly pessimistic (or cynical) view of the world is that we can only trust inferences from models that are (locally) hard to estimate. Nothing that I know about the practicalities of data analysis offers any evidence that this view is too pessimistic. What we can expect, therefore, is that we ought to find trustworthy and helpful – – not as {\it prescribing} procedures, but as useful challenges to procedures, leading to useful illustrations of how procedures function – – those models that are (locally in the space of models) hard to estimate. Locally easy models are not to be trusted.

From one point of view, that we are putting forward is a two-part description of the meaning of the acronym

tinstaafl

well know to science-fiction readers, which stands for

THERE’S NO SUCH THING AS A FREE LUNCH

the two parts being expressible in similar terms, but more closely relevant to our context as

NO ONE HAS EVER SHOWN THAT HE OR SHE HAD A FREE LUNCH

Here, of course, “FREE LUNCH” means “usefulness of a model that is locally easy to make inferences from”.

Any very specific characteristic of a distribution is a potential handle by which easier estimation is possible. So we want models that are as {\it bland} as we know how to make them as our prime guides – – our prime illumination of the behavior of alternative procedures.

There is a sense in which the most useful interpretation of the Central Limit Theorem is not about how nearly Gaussian the distribution of estimates is likely to be, but rather as an indication of blandness of Gaussian models – – since they arise, in the limit, by the disappearance of any and every kind of distinctive behavior. As one of several, or even many, models useful in illuminating behavior of some procedure, then, Gaussian distributions probably do have a special role.

Other references are Chapter 4 Huber and Ronchetti which just repeats the corresponding chapter of Huber’s 1981 book and Chapter 8.2d of Hampel, Ronchetti, Rousseeuw and Stahel.

You see that Tukey uses the word ‘easy’ is the same sense that I use it meaning that it is possible to get a very accurate estimate of the parameter in question. It does not refer to the difficulty of the algorithm required to calculate the estimate. In this sense it is difficult to estimate the mean of a Gaussian and it is most difficult amongst all other models with the same mean and variance. The Gaussian distribution is a worst case. This is what Tukey means and it is what I mean. Looked at it in this way the mean is a minimax statistic. This has nothing to do with using extra information from outside the data set to improve on the mean of the Gaussian distribution.

In terms of severity as defined by Mayo the Gaussian distribution is the least severe amongst all distributions with the same mean and variance. I do not think that this is in any manner a problem for Mayo’s severe testing. One simply has to recognize that severity can be imported through the model, what Tukey calls a free lunch, and state that as far as possible the severe testing should by based on models which do not offer a free lunch.

You dismiss the comb distribution as pathological. You are not the first to use this adjective in this context. I also use the word pathological as in ‘pathologically discontinuous’ but I then proceed to give a more precise definition of what I mean. Thus in a loose way I could also refer to the comb distribution as pathological but I make this more precise by saying that this refers to its high Fisher information leading to extremely precise estimates of the parameter involved.

On moving from the pathological comb distribution to the Gaussian distribution we meet on the way the Laplace distribution, all standardized to have the same mean and variance. Would you also refer to the Laplace distribution as pathological? I doubt it. My argument would still be valid. The Gaussian distribution has a smaller Fisher information and so the use of the Laplace distribution would lead to an increase in efficiency, for example smaller confidence intervals and smaller p-values. Thus unless you have good reasons for the Laplace distribution you should use the Gaussian distribution. What could such a good reason be? At first glance it would seem that the Laplace distribution offers some protection against possible outliers and in a certain sense it does. On the other hand the high Fisher information is due to the peak at x=mu, in particular the discontinuity of the first derivative. This has nothing to do with outliers and consequently these properties taken together do not give a coherent augment in favour of the Laplace distribution. We may be getting some protection against outliers but this does not in itself justify the high Fisher information. It is more coherent to minimize the Fisher information in a Kolmogorov ball which leads to the Huber distributions. Read page 80 of Huber and Ronchetti This is sort of the right answer but not quite because of the scale problem. Discussing this however would take up too much space. As an aside, Tukey continues the above quotation with a discussion of the Cauchy distribution as a model for outliers. He rejects it in favour of the slash distribution because the Cauchy distribution is highly peaked at the origin allowing for an increase in precision which has nothing to do with outliers, the same argument.

My system, or rather my way of thinking, is not complicated or, if you prefer it, it is no more complicated than the subject matter under discussion. The blog is about the foundations of statistics and I don’t think these should be based on adjectives such as pathological if these are not accompanied by a reasonably precise definition of what pathological means. Things are of course much simpler if you ignore these problems. You write

“Do I need to understand regularization that is not needed for my models …”

but it is needed in most situations. Every time you use the Gaussian distribution you have regularized, you have chosen a minimum Fisher distribution which means you have regularized. Whether you understand what you have done or not is another question.

You write

“Many statisticians have only the most vague notion of the centrality of likelihood to statistically supported inferences, and most would deny that a likelihood function has any direct utility.”

I of course agree with them with some well-defined exceptions.

Any satisfactory account of the foundations of statistics must have a Popperian element. Mayo has made this explicit in her writings and was the first to do so. My approach to statistics as the theory and practice of approximate stochastic models has a Popperian element built into it but this is not made explicit by referring to Popper. My approximation regions can be thought of as defining those models which are consistent with the data, a form of severe testing of the models for consistency with the data.

Thanks Laurie. I think I can see what you have been on about a bit more clearly. It sounds like Tukey’s free lunch is a nice companion idea to Good’s subjectivity swept under the carpet idea.

Is it fair to say that we can avoid the problems that come from the non-existence of a free lunch by leaving a big tip in the form of a choice of a model with a low Fisher information at its peak? Like the Gaussian distribution that we almost always use? (If so, then I have routinely avoided the free lunch without even knowing about it.)

Laurie:

Begs for future deconstruction/unpacking:

“In terms of severity as defined by Mayo the Gaussian distribution is the least severe amongst all distributions with the same mean and variance. I do not think that this is in any manner a problem for Mayo’s severe testing. One simply has to recognize that severity can be imported through the model, what Tukey calls a free lunch, and state that as far as possible the severe testing should by based on models which do not offer a free lunch.

Any satisfactory account of the foundations of statistics must have a Popperian element. Mayo has made this explicit in her writings and was the first to do so. My approach to statistics as the theory and practice of approximate stochastic models has a Popperian element built into it but this is not made explicit by referring to Popper. My approximation regions can be thought of as defining those models which are consistent with the data, a form of severe testing of the models for consistency with the data”.

Oliver, I have glanced at the Cox and Reid paper but I miss the connection. This could turn out to be somewhat technical so I suggest we communicate via email.

I could do this but I was hoping that all the other ingredients would be clear from the context. If you have an example where it is not the case let me know. To push it further if one considers only Kolmogorov’s axioms for a probability space then it is nothing more than a normed measure space. What distinguishes probability theory from measure theory is the concept of conditional expectation and this was indeed introduced by Kolmogorov via Radon-Nikodym. Thus instead of a triple one should use a quadruple where the fourth component indicates conditional expectations. I don’t suppose the idea will catch on.

Hi Laurie, sure I’ll send you an email.

One basic point I am trying to make is that since, as you point out, traditional statistics operates ‘within’ a model then to address also the issue of specific model choice (eg Gaussian vs Comb) given a general model form (Location-Scale) the traditional statistician has to treat the model choice simultaneously as the parameter estimation for a given model. This can be considered as parameter estimation for a location-scale model in the presence of a nuisance parameter representing the specific model choice (Normal, Comb etc).

The point of linking Cox and Reid and my simple sketch is to show that choosing the ‘model choice nuisance parameter’ as orthogonal to the parameter(s) of interest (location and scale parameters) leads to a set up where the influence of the model choice is least on the parameter of interest.

The paper itself as well as a number of discussants point out the connections between fisher information, sensitivity/robustness of analysis and orthogonality. The nicest discussions also make more explicit to the somewhat implicit underlying differential geometry framework.

Here is a question, related to my comments above, that I would like to know whether there is a statistical consensus on:

Are unobserved variables, such as the model choice parameters that I discuss above as nuisance parameters, and/or additional ‘latent’ stucture variables, a) to be included ‘within’ the formal statistical model, b) to be excluded but explicitly acknowledged as being conditioned on ‘externally’, or c) are we to limit statistical analysis to observed variables only?

The following quotes illustrate to me an ambiguity among practitioners.

Pearl states (in his ‘why I am not a Bayesian’ paper):

“The demarcation line between causal and statistical concepts is thus clear and crisp. A statistical concept is any concept that can be defined in terms of a distribution (be it personal or frequency-based) of observed variables, and a causal concept is any concept concerning changes in variables that cannot be defined from the distribution alone”

Bunke (1975) states:

“The need of a careful distinction between functional [me: structural] models and distributional models will be emphasised. If we assume a functional [structural] model under certain assumptions the use of the fiducial inference principle (R.A. Fisher [1]) seems quite justified. There are several critics on fiducial arguments in the literature which are based on an unconscious replacement of distribution models by functional models (e.g. J.W. Turkey [2], A.W.F. Edwards [3]). However if the observations obey a functional model which is the starting point for the inference then this criticism becomes irrelevant…

…In cases where the observations with a functional model lead to a restriction of the set of possible values of the error variable a change of the corresponding distribution is forced. The principle of structural error distribution (e.g. D.A.S. Fraser [4], H. and O. Bunke [5]) is like a non-informative principle to use this information. Together with the fiducial assumption we come then to the structural inference principle.”

Fraser (1966) states:

“…the basis for this opinion would seem to be against the standard statistical model: a variable, a parameter, a probability density function,and nothing more. But in many applications there is more and it should appear as additional structure in the statistical model. Some examples will be considered in this section. Against the augmented model Dempster’s remark is inappropriate.

The simple measurement model in ordinary form involves a variable x, a parameter theta and a probability density function f(x-theta) with f specified.

In analyses based on this, there is emphasis on x as a variable and on theta as fixed. The location group as used in Fraser (1961) provides a simple means of describing the fixed shape of the distribution with differing values for theta.

Consider the measurement model in the alternative form x = theta+e. In a particular instance there is a realized value e^r for the error variable. Probability statements concerning the unobserved e^r can be made in exactly the way a dealer at poker might calculate probabilities after dealing cards but before observing any…

…With x observed and known and theta unknown, the distribution of -e which describes theta position with respect to x provides the distribution of possible values for theta

f(x-theta) dtheta,

the structural distribution for theta based on an observed x…

…For someone committed to the ordinary statistical model, an extra principle is seemingly needed. But its introduction can only be to compensate for an inadequacy in the ordinary model. For the alternative form of the measurement model, an extra principle is not needed: x is known, theta is unknown,the distribution of theta position with respect to x is known, and the origin of the scale of measurement is conventional. ”

Jaynes states:

“What if a part of the data set was actually generated by the phenomenon being studied, but for whatever reason we failed to observe it? This is a major difficulty for orthodox statistics, because then the sampling distributions for our estimators are wrong, and the problem must be reconsidered from the start. But for us it is only a minor detail, easily taken into account. We show next that probability theory as logic tells us uniquely how to deal with true but unobserved data; they must be relevant in the sense that our conclusions must depend on whether they were or were not observed; so they have a mathematical status somewhat like that of a set of nuisance parameters.”

Michael, I am glad that that helped. You write ‘Tukey’s free lunch’. In my own defence I point out that the possibility of a free lunch was shown by me in the paper referred to by Tukey. As far as I know Tukey was not aware of this until he read my paper. Tukey’s own paper is entitled

Issues relevant to an honest account of data-based inference, partially in the light of Laurie Davies’s paper.

Here is part of the first section after the introduction

##

\section{A is for Approximation}

Davies’s emphasis on approximation is well-chosen and surprisingly novel. While these will undoubtedly be a place for much careful work in learning how to describe the concept – – and its applications – – in detail, it is clear that Davies has taken the decisive step by asserting that there must be a formal admission that adequate approximation, of one set of observable (or simulated) values by another set, needs to be treated as practical identity.

##

The formal proof of the possibility of a free lunch exploits the difference in topologies of EDA and formal inference and the pathological discontinuity of the differential operator. This in turn makes the case regularization. One possibility is to minimize the Fisher information. Tukey knew immediately what I meant and translated it into his own much more amenable language without any loss of accuracy.

If you happen to have a very bright student who points out the the Laplace distribution is also consistent with the data but yields shorter confidence intervals you can point out that the Fisher information is larger and the shorter intervals are a free lunch. This is much more convincing than an appeal to pathology which will not work in this case.

In general it will not be possible to solve a regularization problem but it is possible to examine the model to see if there are any ‘horns’ (a Tukey term) which can be exploited to increase efficiency and which represent a free lunch. The discontinuity of the first derivative of the Laplace distribution is such a horn unless you have a very good reason for requiring it. Most standard models do not have horns, other models may have horns which can be justified. So I think you are probably have avoided free lunches even if you were not aware of doing so. In general bland models will have a low likelihood compared with models that offer a free lunch but I have no proof of this. So beware of high likelihood values.

If you would like a pdf version of Tukey’s paper based on a Latex transcript send me an email.

Laurie, thank you. Now I see. Your idea is interesting and novel to me, but I’m not sure where to place it among my inferential concepts.

If a bright student of mine (I wish!) were to come to me and suggest using a Lapalce distribution instead of a Gaussian because it yielded narrower intervals I would have, without knowing about the (_your_) free lunch idea, explained that choosing a model on the basis of the narrowness of the interval (or any other output) is fraught with danger, and that it yields tests with bad frequentist accounting properties. (Anyway, the Laplace distribution would yield _wider_ likelihood intervals for many levels of likelihood ratio as it is thick-tailed compared to the Gaussian.) Presumably your free lunch is a way to theorise and explore the problem.

Now that I feel that I understand the free lunch, I am worried that your disdain for the likelihood function is based on something else. That’s because I do not see how the free lunch problem leads to any difficulty for a likelihood function based on a Gaussian model that is (i) chosen in advance of, or independently of, the dataset being analysed, and (ii) has a low Fisher information at its peak.

Deborah, one way of stating it is that a no test based on a free lunch is severe.

Larui: Severity is always relative to an inference, by the way. So what is the inference corresponding to a free lunch? perhaps an example would be a data-dependent alternative H’ (to a null) that perfectly fits the data, or, more generally, one that that infers H’ even though the method has done little or nothing to rule out ways the inference can be wrong.

Oliver, (i) I sort of thought that this is what you meant but I was intimidated by the thought of moving from a finite set of parameter values in Cox and Reid to some large set of probability models. This is what I meant by technical but I guess you guessed that. Have you some ideas on how to formulate this?

(ii) I had to stop at Pearl because I have never read on of his work but have now started on Causal inference in statistics: An overview. My own thoughts on causality are that it is always speculative. How about Einstein’s theory of Brownian motion being caused by molecular bombardment? Is this even a valid question?

Laurie, i) I only have a couple of ill-developed ideas. All of these seem, however, to just lead back to something like your own approach for the most general case. These thoughts are mainly my attempt to understand the pragmatic Likelihoodist/Bayesian approach such as Gelman’s in BDA3 which is mainly (but not entirely) parametric with clever conditioning and careful ‘within/without’ distinctions, and Edwards’ book on likelihood. Also Fisher and his ‘structural’ descendants like Fraser.

There is some interesting work in the infinite-dimensional Bayesian inverse problems world which I’ve been trying to understand better. A lot of it comes down to ‘discretise then solve as a standard problem’ vs ‘formulate properly in function spaces and only discretise at the last minute’. Most people have traditionally taken the former route but there is a lot of interest in the latter these days. Stuart is always a good read on this. Tarantola’s book was the first place a saw a lot of these ideas formulated.

ii) yes I agree. I tend to view causal assumptions as assertions of exact or approximate invariances of some sort. Logically they almost always go beyond the available data (eg how would you experimentally establish a continuous symmetry) and, simplistically, may be falsified perhaps but not ‘inferred’ or ‘detached’ with any reliability. Some of Pearl’s work I agree with, some I don’t. I don’t like how he introduces asymmetry a priori for example. Herbert Simon has an old paper somewhere on similar ideas to Pearl (ie from a computer science/AI/philosophy point of view) but treats symmetry/asymmetry better imo. I also don’t think Pearl understands Gelman’s view on statistics, possibly because Pearl restricts statistics to observables while Gelman doesn’t.

Deborah, it is inference based on a model that offers a free lunch. Testing based on a model which offers a free lunch cannot be severe. Free lunch is here meant in the sense of Tukey given above. It is not a data dependent alternative. An example would be to use the Laplace model instead of the Gaussian model although both fit the data and there is no scientific reason for one rather than the other. A concrete data set for which this holds is the copper data I have mentioned several times.

Laurie: If all inferences that are logically underdetermined or are based on fallible assumptions yield 0 severity tests, then severity would be entirely irrelevant to learning in the real world, which invariably is based on assumptions (reflected in the inferences being approximate). Rather a costly lunch for anyone interested in how we really do gain the kind of knowledge we do. So I must b missing, and really, being in the midst of travel, perhaps shouldn’t jump in.

Michael, here some remarks on likelihood.

(a) Likelihood reduces the measure of fit between a data set x and a statistical model P_theta to a single number irrespective of the complexity of both.

(b) Likelihood is dimensionless and imparts no information about closeness.

(c) Likelihood is blind. Given the data and the model or models, it is not possible to deduce from the values of the likelihood whether the models are close to the data or hopelessly wrong.

(d) Likelihood does not order models with respect to their fit to the data.

(e) Likelihood based procedures for model choice (AIC, BIC, MDL, Bayes) give no reason for being satisfied or dissatisfied with the models on offer.

(f) Likelihood does not contain all the relevant information in the data x about the values of the parameter theta.

(g) Given the model, the sample cannot be reduced to the sufficient statistics without loss of information.

(h) Likelihood is based on the differential operator and is consequently pathologically discontinuous.

(i) Likelihood is evanescent: a slight perturbation of the model P_theta to a model P_theta^* can cause it to vanish.

On the positive side:

(j) Likelihood delimits the possible.

Laurie, so many of your points seem to be erroneous that I have to suppose that we are, again, talking of different things.

Here are my quick responses to your enumerated points.

(a) Wrong. Given a model, a dataset yields a value of likelihood for each of the possible values of the parameter(s) of interest. Those values make up the likelihood function.

(b) Mistaken on two counts. First, probability if dimensionless, and yet we assume that it has some utility. Second, it is the likelihood function that tells of closeness, not a single value of likelihood.

(c) True, but unimportant in most circumstances. Likelihood functions rank the possible values of the parameters once a model is chosen.

(d) True, but irrelevant. A hammer is pretty useless for microsurgery and a microscope is useless for large-scale carpentry. Use the appropriate tool for the task.

(e) True, but irrelevant. If your task is model choice then choose a different approach. (I would not leave the choice of model to statistics for my own type of purposes. Make a thoughtful choice.)

(f) Define “relevant”. The likelihood function contains all of the information relevant to ranking parameter values within the model.

(g) Really? What does “sufficient statistic” mean in that case?

(h) I still don’t understand what you mean by that statement. Has Tukey explained it?

(i) True, but that is as it should be. Likelihood functions are only useful within a model. Change the model and the function should change.

(j) True, but only _within_ the model. Other models will have potentially other parameters which offer other possibilities.

Michael, no, we are talking about the same object but looked at from very different angles.

(a) Right. In my notation a model is a single probability measure. I reformulate it in your notation. The likelihood reduces the fit between each distribution P_theta and the data x to a single number. So x is an image, P_theta is one particular distribution over the space of images, and the likelihood reduces the relationship between these two very complicated objects to a single number. Stringing all these numbers together to give a function does not in any way alter this fact.

(b)

(c) Density estimation. The parameter theta includes k, the number of Gaussian distributions in the mixture and the mean and variance of each of the k components. sum_1^kN(mu(i),sigma(i)^2). The model is chosen. The best supported values of theta are those with k=n, number of observations, mu(i)=x(i) and sigma(i) small. Density estimation is ill-posed and requires regularization just as the location-scale problem requires regularization. So I minimize the number of peaks subject to constraints given by a Kuiper metric.

(d) In statistics the appropriate approach is to recognize the fact that the problems are ill-posed and require regularization. The form of regularization depends on the problem under consideration; location-scale, density estimation, non-parametric regression, choice of models. Given this one performs the regularization in so far as this is algorithmically (?) possible. No role for likelihood in this approach.

(e) ok

(f) It does not tells us which parameter values are consistent with the data and if they are not consistent why they are not consistent. Thus in the Gaussian location-scale problem a parameter value may not be consistent because it makes one or more of extreme values outliers. This happens by the way with the copper data.

(g) In the sense that the likelihood can be reduced to a function of the sufficient statistics, mean and standard deviation for the Gaussian family. Given the sufficient statistics what is left is independent of the parameter theta. At least this is how I have so far understood the likelihood claim.

(h) Tukey:

##

Davies emphasizes the pathological discontinuity of estimate behavior as a function of assumed model. In any neighborhood of a simple model (location, or location- and-scale, say) there are arbitrarily many (similarly simple) models with arbitrarily precise potentialities for estimation – – potentialities that can only be realized by very different procedures that [sic] those that realize the potentialities of the original simple model.

There are always nearby models that “describe the data” – – and are arbitrarily easier to estimate from. There may or may not be models that “describe the data” and are harder to estimate from.

##

Tukey understood and accepted the argument. The pathological discontinuity of the estimate derives from the pathological discontinuity of the differential operator. We have already had a discussion about the latter.

(i) Change the model and there may be no measure which dominates P_theta^* for all theta simultaneously. If you don’t have this you don’t have any likelihood.

(j) If you minimize Fisher information amongst all distribution with a given mean and variance the result is the Gaussian distribution with this mean and variance. It is worst case. What is the best you can do in worst case? It is often not quite clear but a often a good default is to use maximum likelihood for this distribution, in this case the mean. This will tell you what is best possible under the worst circumstances, minimax. It delimits the possible after regularization. The regularization component is necessary.

My world of statistics consists of individual probability measures, a weak topology over the set of probability measures which defines what close is, a concept of how such objects can be said to approximate a data set x, and a realization that most statistical problems are ill-posed, from location-scale to image analysis, and require regularization. It is from this point of view that I look at likelihood and find it lacking in virtually every aspect.

Laurie, your statistical world is clearly far more complicated than mine. In my world we are able to say things like “Given a statistical model” and take the analysis from there. The decision as to which model to use requires thinking about distributions, assumptions about representativeness of samples, and sometimes issues that relate to convenience of calculation etc., but likelihoods based on the observed data by themselves do not provide the information necessary to choose the model. In my world the likelihood principle applies and it is fair to say that the relative values of a likelihood function supply a ranking of the goodness of fit of all possible values of parameters within the selected model.

In your world the word ‘model’ means something quite different and so the likelihood principle has no meaning, no relevance, or it requires a translation. Fine.

I will take your word for the lack of utility of likelihood in your world, but I do not think that that lack of utility detracts from the utility of likelihood in my defined statistical model world.

Neither of our worlds can properly be called ‘the real world’, but I will claim that my world resembles the statistical worlds of Gossett, Fisher and Neyman, and it is the world in which likelihood is central. It is also the world in which this blog spends most of its time.

Oliver, you have read a lot more on these topics than I have but I find this useful, it gives me hints about what I could/should read. At one point Pearl say that one cannot deduce cause from association (symptoms and illness) and seems to suggest that some people nevertheless do. This is pure Hume to me but he is not given in the list of references. As far as I understand it after a very short perusal Pearl’s main point is that the causal assumptions should be made explicit and he gives a calculus for doing this. Is this correct?

Laurie – yup (based on my understanding)

By the way (if interested) Dash et al in ‘Toward a New Representation for Causation in Dynamic Systems’ (http://www.dsi.unive.it/PhiMaLe2011/Abstract/Dash_Voortman.pdf) nicely summarise the difference between Pearl and Simon’s approaches:

The structural equation approach is essentially the representation advocated by Spirtes et al. [1993] and

Pearl [2000] who also co-identified a set of assumptions under which the structure of causal models may be recovered from observational data. This work has been very influential in machine learning…

…The view of causation proposed by Simon [1952] was slightly different. Instead of pre-solving each variable in terms of its causal parents, a system is described as a set of fundamental mechanisms which corresponded to invariant laws of nature governing the system. These mechanisms by themselves are symmetric, but once they are placed in the context of a fixed set of boundary conditions, they lead to assymmetric causal relations.”

I find this latter approach much more consistent with my own views (and consistent with the basic lessons of e.g. statistical mechanics re: irreversibility, time’s arrow, causality etc)

Michael, missed (b), sorry. First count: yes you are correct, probability is also dimensionless and the statement will need rethinking. Second count: correct as stands. The likelihood function for the Gaussian distribution depends only on the mean and variance of the data. All data sets with the same mean and variance have the same Gaussian likelihood function. Consequently the likelihood function cannot give you any idea of closeness. The Kuiper metric in contrast does.

RE: “The likelihood reduces the fit between each distribution P_theta and the data x to a single number. So x is an image, P_theta is one particular distribution over the space of images, and the likelihood reduces the relationship between these two very complicated objects to a single number. Stringing all these numbers together to give a function does not in any way alter this fact.”

I wonder what you (either Laurie or Michael in fact) would think of the following procedure:

Suppose you have a given image Y0

– Take a stochastic model capable of generating ‘random images’.

– *Define* the meaning of ‘image 1 = image 2’ by ‘given a collection of functionals (from image space to real numbers) then, for each functional F_i, the value |F_i(image1) – F_i(image2)| < eps_i for the user-specified tolerance eps_i (one per functional, say). E.g. 'average pixel intensities for the two images are equal within the given tolerance'.

– For each parameter theta, generate a large collection of images {Y1,Y2,….} based on your model. Say N images per parameter value.

– For each of these parameter values assign a 'score' to the given parameter based on how many of its generated images Y satisfy Y = Y0 according to the above definition of 'image equality'.

– Plot the scores vs the corresponding parameter values.

Is this a sensible procedure? How do you interpret it?

Michael, you remember the Gelman’s mixture model. A perfectly innocuous model but the likelihood is largest for mu_1=x_i for some data point x_i and then letting sigma_1 tend to zero. I gave the example of a mixture of Gaussian distributions where the likelihood is maximized by imitating the data. So no likelihood does not order the parameters in terms of goodness-of-fit. It may do sometimes but there are simply too many exceptions to make this a general principle or a reliable guide. Users of likelihood are caught between wanting likelihood to be high but not too high and have no principles for deciding when high is too high. It seems to me that every time this happens the argument is that such values are clearly inconsistent or too consistent with the data and then some ad hoc argument is invoked to justify ignoring them. Weak metrics as a measure of closeness do not suffer from this problem and they are a natural measure of closeness.

You write

##

The decision as to which model to use requires thinking about distributions, assumptions about representativeness of samples, and sometimes issues that relate to convenience of calculation etc., but likelihoods based on the observed data by themselves do not provide the information necessary to choose the model.

##

I agree. Once I have done this and, if I am using a parametric model, specified the parameter values which are consistent with the data there is no more to be said for this particular model. Moreover any approach to statistics which is not able to give an account of the choice of model leaves much to be desired, in fact it ignores the main problem.

Your use of the word model for a parametric family of models is a reflection of your attachment to likelihood. What do you do if there is no simple parametric model for the data?

Laurie, you wrote: “I gave the example of a mixture of Gaussian distributions where the likelihood is maximized by imitating the data. So[,] no[,] likelihood does not order the parameters in terms of goodness-of-fit. It may do sometimes but there are simply too many exceptions to make this a general principle or a reliable guide.”

The problem here is that the likelihoods can only give a ranking of the goodness of fit of parameter values _within the model_. Thus if you choose a model with bad properties for the purpose of a likelihood-based ranking of parameter values then you can get a bad result. Models with redundancies among the parameters often perform badly, as do models with more parameters than you have data points. So what? Those same models would perform badly for most types of analysis. You have to deal with model selection separately from parameter value evaluation. [Note that I am using “model” in the standard way to mean statistical model that usually specifies a sampling distribution for the test statistic for all values of the model parameter(s). You may habitually use the word differently, but my comments have to be evaluated on the basis of my intended meaning of “model”.]

You go on: “Users of likelihood are caught between wanting likelihood to be high but not too high and have no principles for deciding when high is too high. It seems to me that every time this happens the argument is that such values are clearly inconsistent or too consistent with the data and then some ad hoc argument is invoked to justify ignoring them.”

That is not true. Read my previous paragraph again. [Pause for re-reading.] Now, note that if you have chosen a model and you get a very high likelihood for a particular parameter value, that means only that a parameter value is strongly supported by the data _within that model_.

More: “Moreover any approach to statistics which is not able to give an account of the choice of model leaves much to be desired, in fact it ignores the main problem.”

I see why you might say that, but the reality is that the best tools often have limited scope. A hammer is great for driving nails, but when used to drive a screw it does a poor job. Exactly why does an “approach to statistics” have to use one tool for all purposes? It doesn’t. You seem to be beating on likelihood for a limitation that no likelihoodlum would deny, and a limitation that applies equally well to most single-method “approaches to statistics”, including frequentist, Bayesian and severity-based methods. [Model selection may be “the main problem” for _your_ purposes, but I do not think that it is for most situations where experimental data are analysed statistically to support inferences.]

You finish with a question: “Your use of the word model for a parametric family of models is a reflection of your attachment to likelihood. What do you do if there is no simple parametric model for the data?”

If I notice that there is not simple parametric model for the data I might reach for a non-parametric analysis. Or I might design the experiment differently so that a simple analysis based on a safe Gaussian model is appropriate. The fact that there is no clear route to a likelihood function in cases where a non-parametric model is used does not detract from the utility of the likelihood function in the common cases.

You are making very generalised and universal-sounding criticisms of likelihood on the basis of quite specific issues that have restricted scope. I believe that I have made the points of this comment to you previously. What specifically do I fail to communicate?

Michael – you might like the Wood paper I linked below. I think both you and Laurie would agree with aspects of his analysis (though both of you may not like others…). Perhaps you could agree under the umbrella of ‘Likeliness’ (see my comments below) instead of ‘Likelihood’.

I think y’all should be interested in the latest goods being sold by Berger et al.. It’s not just the problem with using power/alpha as a pre-data rejection ratio:

https://errorstatistics.com/2016/04/11/when-the-rejection-ratio-1-%CE%B2%CE%B1-turns-evidence-on-its-head-for-those-practicing-in-the-error-statistical-tribe/

but the magic (p. 11) by which optional stopping (with a proper stopping rule) (a) is so damaging as to make collecting data pointless for a frequentist (they say, since a rejection is guaranteed), (b) but does no damage to the reliability of the corresponding Bayes factor!

https://errorstatistics.files.wordpress.com/2016/04/2016-published-version-of-bayarri-benjamini-berger-sellke-rejection-odds.pdf

It’s true, as they point out, that the difference in the coefficients cancels out in the ratio, but the error probability control goes out the window!

I would like to point to the fact that a procedure guaranteed to reject the null has a very good power, and so its false negative error control is attractive. (Tongue only slightly in cheek here: this is an example of where “error probability control” refers exclusively to false positive errors.)

Seriously, the evidence in the data in relation to the model parameter values (within the model) are unaffected by the stopping rules. Get over it. If you want to deal with error control then you are want to do something other than looking at the evidence in the data relative to the values of the parameter(s) of interest in the model.

Oliver, yes, that is in principle how I see it. If you are lucky you may avoid simulations in some cases by using an asymptotic approximation valid for the n you have. In other words you use different measures of fit which reflect different properties of the image (data) you are analysing.

Thanks Laurie. Given my phrasing above I think it would be possible to interpret this as a suitably generalised likelihood analysis.

In particular, since you plot the percentage of accepted values and not just whether this reached a threshold this seems to make it more akin to a likelihood analysis.

Since we’ve redefined ‘observed data = simulated data’ according to a notion of ‘look like’, however, it’s not quite standard likelihood.

Maybe it could be called a ‘Likeliness function/region’, emphasising the ‘look like’ part. Wood I believe calls it a ‘synthetic likelihood’ approach (http://www.nature.com/nature/journal/v466/n7310/abs/nature09319.html, pdf – http://faculty.bscb.cornell.edu/~hooker/NonlinearDynamics/Papers/Wood_2010.pdf)

(Note that in the paper I linked Wood uses a Normal model for the summary statistics but this is not necessary in general, at least according to my conception of a ‘likeliness’ algorithm. ABC is a very closely related procedure too, as I’ve mentioned before. As you’ve noted, this approach may make sense and work well but does deviate from the ‘standard story’ of likelihood/bayes somewhat).

Also, here is the key part of Wood (imo):

“The problem with the conventional approaches is actually philosophical. Naive methods of statistical inference try to make the model reproduce the exact course of the observed data in a way that the real system itself would not do if repeated.

Although the dynamic processes driving the system are a repeatable feature about which inferences might be made, the local phase of the data is an entirely noise-driven feature, which should not contribute to any measure of match or mismatch between model and data. Hence, if statistical methods are to correspond with what is scientifically meaningful, it is necessary to judge model fit using statistics that reflect what is dynamically important in the data, and to discard the details of local phase.

In itself, this idea is not new. What is new is the ability to assess the consistency of statistics of the model simulations and data without recourse to ad hoc measures of that consistency, but in a way that instead gives access to much of the machinery of likelihood-based statistical inference”

Interesting paper, for sure. However, the chaotic time-varying systems about which Wood makes the statements that you quote are quite different in nature to the simple systems that I am usually dealing with. The population and sample paradigm is quite close to the underlying realities of many conventional pharmacological experiments. We should not overload the analyses of simple systems with ideas that are necessary to more complicated setups.

I mean, it’s just some experiments on blowflys. It seems a bit arbitrary to me to say pharmacology is in anyway simpler except at this particular time with these particular setups. I don’t doubt that likelihood analysis is useful for your work. I just don’t see what the point of promoting something like Likehoodism as a general statistical philosophy (Edwards style or whatever) is if it wouldn’t survive simple generalisation to eg a model of a single ion channel let alone thinking about more complex pathways of drug action.

Plus in many ways an approach like Woods’ or the sketch I gave is a) likelihood-like and b) in many ways simpler.

It’s a simple algorithm – in fact I just gave a simplified version to some second year engineering students as part of an assignment for a probability course and some coded it up in R and sent me results the same day.

The version I gave them has a simple interpretation – for each input parameter what percentage of the generated data ‘looked like’ the given data. They plot this ‘likeness curve’ (really just a form of likelihood function) and interpret it. Easy and fairly unobjectionable no? It also fits naturally with the follow on modules on exploratory data analysis and markov processes (eg they could fit rate parameters for models of drug action on markov models of channel gating). Closer to ‘one mode of analysis’ as Laurie says.

Michael, you write

##

You have to deal with model selection separately from parameter value evaluation. [Note that I am using “model” in the standard way to mean statistical model that usually specifies a sampling distribution for the test statistic for all values of the model parameter(s).

##

Ok, so let us take the copper data and now tell me the reasoning behind your decision to use the Gaussian model in your sense. The only sense I can make of selecting a parametric model in your sense is that there exist parameter values theta such that the very specific model P_theta in my sense is consistent with the data. Thus the N(2.00,0.1^2) is consistent with the copper data where I make very precise by what I mean by consistent so that it can be checked. We can also discuss my definition of consistent . On the other hand the N(100,0.1^2) model in my sense is not consistent with the data. You seem to be claiming that you can choose the parametric model without reference to any values of the parameter and their relationship to the data. I don’t think this is a tenable position so you must mean something else.

Remind me of what the famous “copper data” are, and the objectives of the study.

Laurie, I have re-read my comment and I do not see any justification for your assertion: “You seem to be claiming that you can choose the parametric model without reference to any values of the parameter and their relationship to the data.” You seem to be verballing me, and I feel disinclined to continue treating your comments seriously.

Oliver, I have had a look at the paper and find it interesting. I hope to find time in the next few days to give a method of analysing data based on the Ricker model. Here is a link to a paper on modelling the daily returns of the S+P 500. The data are quite complicated and as there is no standard dynamic model we had to develop one of our own.

https://www.statistik.tu-dortmund.de/fileadmin/user_upload/DP_4815_SFB823_Davies_Kraemer.pdf

Great, I look forward to it! And thanks for the link.

I’m trying to collect some examples at the moment for teaching undergrad engineers and mathematicians EDA/simulation-based data analysis and inference, so any examples you have are great.

Michael, I am not verballing you at all although I am not sure what that means. My comments are meant seriously, I am not playing. Once again

#

You have to deal with model selection separately from parameter value evaluation.

#

I do not understand this. I gave an interpretation with which you did not agree. I suggested taking a concrete example the copper data, measurements of the quantity of copper in a sample of drinking water milligrams per litre, all done under the same conditions using the same procedure, and you now detail your reasons for accepting or not accepting the Gaussian model in your sense for this particular data set

2.16 2.21 2.15 2.05 2.06 2.04 1.90 2.03 2.06 2.02 2.06 1.92 2.08 2.05 1.88 1.99 2.01 1.86 1.70 1.88 1.99 1.93 2.20 2.02 1.92 2.13 2.13

Laurie, in order to know what type of model is appropriate I would need to know the nature of the inference(s) that might be made and, often, more about how the values were obtained. For most purposes you need no model beyond what ever the model is behind a statement that the level of copper appears to be around 2 (units not specified). What would be wrong with using the mean or median and the standard deviation of those numbers?

Verballing: Google seems to recognise it.

verb BRITISH informal

gerund or present participle: verballing

attribute a damaging statement to (a suspect), especially dishonestly.

Michael, if you don’t like my data set take one of your own. You mention

blood pressures of rabbits, the diameters of mesenteric arteries, the changes in diameter of arterioles upon pharmacological challenge, the logarithm of concentrations of numerous blood-borne mediators, the logarithm of the EC50s of agonist concentration-response curves,

If the data set is sufficiently small you can give it but what I am really interested in is the arguments you put forward, the calculations you use.

Laurie, if I am looking to see how much a treatment alters blood pressure then I would usually be happy to assume that a treated group will have a different mean to the control group. The mean is then the parameter of interest. I know that blood pressures of a healthy set of rabbits are fairly symmetrically distributed and I know that the normal distribution is good enough for most purposes (and cannot be discounted as being an accurate account of the ‘true’ distribution with small samples). Thus I would reach for the Gaussian. There is nothing strange or unusual about my calculations. Unless I were prepared to assume a known variance (I usually am not), my parameter of interest is not separable from the nuisance parameter, sigma, so I would apply a Student’s t-test. Likelihood function comes from the non-central distribution of Student’s t (as does the severity curve).

What is your question?

Perhaps the remarks of Borel and Neyman in the play in my current blogpost is relevant to some of your skepticism. It’s in honor of Neyman’s birthday today.

https://errorstatistics.com/2016/04/16/jerzy-neyman-and-les-miserables-citations-statistical-theater-in-honor-of-his-birthday/

Michael, you have more or less answered it and the answer was along the lines I was expecting.

#

I know that blood pressures of a healthy set of rabbits are fairly symmetrically distributed and I know that the normal distribution is good enough for most purposes (and cannot be discounted as being an accurate account of the ‘true’ distribution with small samples). Thus I would reach for the Gaussian

#

As you state, I have no doubt that it is good enough for most purposes. However there is no formal procedure for deciding that the data are consistent with the model and as such this cannot be the basis of an account of best statistical practice. You have written several times that you always work within the model and at this stage the model is never questioned. You may give confidence intervals for mu and sigma, calculate p-values and interpret the likelihood function but the model itself is not questioned. My approach is different. Suppose the family of models under consideration is the Gaussian family. Then for each (mu,sigma) I test whether the model N(mu,sigma^2) is consistent with the data. The approximation region is the set of all such (mu,sigma). Once I have done this likelihood has nothing to add. If there is a conflict then between the approximation region and the parameter values supported by likelihood then the approximation region wins. There is such a conflict for Gelman’s normal mixture model where the parameter values with very high likelihood are simply not consistent with the data. The goal is one mode of analysis where the acceptance of a model is also subject to severe testing .

Laurie, you say “Then for each (mu,sigma) I test whether the model N(mu,sigma^2) is consistent with the data. The approximation region is the set of all such (mu,sigma). Once I have done this likelihood has nothing to add.”

How is that different from generating the likelihood function on mu and sigma? It sounds like likelihood would have nothing to add to your procedure because your procedure IS likelihood. Perhaps a difference appears when you are considering multiple families of models.

You should note that the decision regarding the distribution is not made on the basis of the particular dataset in question, but on prior experience along with an understanding of the system under investigation. I routinely warn my students that if their only information about the relevant distribution is the dataset that they have in hand then they have to assume that they do not know enough to specify a distribution (i.e. model). It is best practice to have many experiments under the belt before designing and executing the experiment of record. Exploratory and planned experiments should be dealt with differently. You can see my commentary on the ASA P-values statement for a lot more on this topic: http://www.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108

Michael, the basic idea is that a model P, a single probability measure, for example N(2,0.1^2), is an adequate approximation to a data set x if ‘typical’ data X(P) generated under P ‘look like’ x. In any particular case ‘typical’ is quantified by a number alpha, 0 < alpha < 1 so that 100*alpha% of the samples are typical and 'looks like' is also made precise but this depends on the problem at hand. For the family of Gaussian measures the parameters (mu,sigma) are checked by asking whether the mean, the standard deviation, the degree of outlyingness of the extreme observations and the Kuiper distance to the N(mu,sigma^2) distribution all have values which 'look like' the corresponding values generated under the N(mu,sigma^2) distribution. More precisely the values for the data x must lie between upper and lower quantiles for the corresponding statistics for data generated and N(mu,sigma^2). Exactly which quantiles depends on how the user wishes to weight the four statistics but they should be such that N(mu,sigma^2) is an adequate model for data generated under N(mu,sigma^2) with probability alpha. Here is the R program which does this

fpval<-function(x,alpha=0.9,ngrid=20){

n<-length(x)

x<-sort(x)

beta<-(3+alpha)/4

# beta<-(1+alpha)/2

# if(mod=="g"){

q1<-qnorm((1+beta)/2)

q21<-qchisq((1-beta)/2,n)

q22<-qchisq((1+beta)/2,n)

q3<-qnorm((1+beta^(1/n))/2)

q4<-qkuip(beta)

sigu<-sqrt((n-1)*sd(x)^2/(q21-q1^2/n))

sigl<-sqrt((n-1)*sd(x)^2/q22)

# print(c(sigl,sigu))

# print(q3)

# q3<-4

muu<-mean(x)+q1*sigu/sqrt(n)

mul<-mean(x)-q1*sigu/sqrt(n)

# print(c(mul,muu))

# }

theta<-double(7*(ngrid+1)^2)

dim(theta)<-c((ngrid+1)^2,7)

ic<-0

for(j in 0:ngrid){

sig<-sigl*exp(j*log(sigu/sigl)/ngrid)

for(i in (-ngrid/2):(ngrid/2)){

ic1<-0

ic2<-0

ic3<-0

ic4<-0

mu<-mean(x)+2*i*q1*sig/(ngrid*sqrt(n))

y<-(x-mu)/sig

pmn1q1+0.0001){ic1<-1}

# if(ic1==0){

sy2<-sum(y^2)

pmn2q22)|(sy2<q21)){ic2q3){ic3<-1}

pmn3<-1-(2*pnorm(max(abs(y)))-1)^n

# }

# if(ic1+ic2+ic3==0){

dk<-max((1:n)/n-pnorm(y))-min((1:n)/n-pnorm(y))

dk<-max(dk,max((2:n)/n-pnorm(y[1:(n-1)]))-min((2:n)/n-pnorm(y[1:(n-1)])))

pmn4q4){ic4<-1}

# }

if(ic1+ic2+ic3+ic4==0){

# if(ic1+ic2+ic3==0){

ic<-ic+1

theta[ic,1]<-mu

theta[ic,2]<-sig

theta[ic,3]<-pmn1

theta[ic,4]<-pmn2

theta[ic,5]<-pmn3

theta[ic,6]<-pmn4

theta[ic,7]0){

theta<-theta[1:ic,]

# plot(theta[,1:2])

# print(c(min(theta[,1]),max(theta[,1]),max(theta[,3])))

}

else{theta<-0

dim(theta)<-c(1)

}

list(theta)

}

You also need for the Kuiper distance

#

pkuip<-function(x){

if(x<=0){ss<-0}

else{

ss<-2*(4*x^2-1)*exp(-2*x^2)

eps<-1

m10^(-4)){

m<-m+1

y<-m*x

eps<-2*(4*y^2-1)*exp(-2*y^2)

ss<-ss+eps

}

ss<-1-ss

}

return(ss)

}

#

qkuip<-function(alpha){

x2<-1

ss<-pkuip(x2)

while(ss<alpha){

x2<-2*x2

ss<-pkuip(x2)

}

x1<-0

x3<-(x1+x2)/2

eps10^(-3)){

q3<-pkuip(x3)

eps<-x2-x1

if(q3<alpha){

x1<-x3}

else{x2<-x3}

x3 tmp plot(tmp[[1]][,1],tmp[[1]][,2])

> print(tmp[[1]][1,],dig=3)

[1] 1.9887 0.0901 0.1167 0.0273 0.0358 0.5391 0.0273

>The first two values are mu and sigma, the next four are the p-values for the mean, the standard deviation, the degree of outlyingness and the Kuiper distance. The last gives the minimum of the four p-values. From this you can read off that the mean value is reasonable, the value of sigma is just acceptable, one of the two extreme values of the data is just under the level of being declared an outlier and the Kuiper distance is acceptable. In all cases the cut-off p-value is 0.025.

I leave it to you as to whether you would call this likelihood based or not. It does not conform to my understanding of the term likelihood. If you were to do a likelihood analysis as I understand it and if one set of parameters theta_1 were better supported than another set theta_2 in terms of likelihood but theta_1 was not in my approximation region whereas theta_2 was then I would claim priority unless you had good arguments against the definition of approximation I was using. Gelman’s mixture model is an example that this can happen.

I read your comments on the ASA P-values statement. The choice of model family clearly depends on what you know about the data under consideration. I have analysed data sets where the physicist stated that high counts are Poisson but low ones not because of interference. Sometimes as with the copper data you simply cannot derive any model from the manner in which the data were generated, it is much too complicated and in this situation any model is ad hoc and regularization is a necessary. For me a p-value is a measure of how bad the model determined by the hypothesis approximates the data, that is, is not consistent with the chosen definition of adequacy. Why this is so is a question which can only be answered if at all by knowledge of the data involved which is why I would never use the word significant or effect or claim that otherwise something improbable has happened. Simply that data re not consistent with the model specified by the hypothesis. In the case of the copper data if hypothesis that the legal limit is exceed has a very small p-value then the explanations could be that the limit was not exceed or that the instrument used to make the readings was badly calibrated.

Laurie: Since I’ve missed some of this back and forth, I wasn’t going to jump in, except that I noticed what you wrote in your comment, and it’s in good sync with my own view:

” the basic idea is that a model P, a single probability measure, for example N(2,0.1^2), is an adequate approximation to a data set x if ‘typical’ data X(P) generated under P ‘look like’ x. In any particular case ‘typical’ is quantified by a number alpha, 0 < alpha < 1 so that 100*alpha% of the samples are typical and 'looks like' is also made precise but this depends on the problem at hand." We'd add other requirements such as, if the model M were inadequate (in a specified manner and degree), we'd not very frequently get data deemed typical of M. Or some such stipulation…

Deborah, you are quite right. To take a simple example. The model is N(mu,1) and alpha=0.95. Typically (alpha=0.95) the mean of an i.i.d. sample of size n under N(mu,1) lies between mu-1.96/sqrt(n) and mu+1.96/sqrt(n). If your definition of approximation is based on typical values of the mean only then after some elementary manipulations the approximation region for mu will be [mean(x)-1.96/sqrt(n),mean(x)+1.96/sqrt(n)] which agrees with the standard confidence interval. Thus if x consists entirely of zeros the approximation region would be [-1.96/sqrt(n),1.96/sqrt(n)] which is nonsense. For the family of normal models I check not only whether the mean of the data look like the means of samples under the model but also the variance, the degree of outlyingness of the extreme observations and the distance of the empirical distribution from that of the model distribution. This is what you would call severe testing. When I wrote my book I was not aware of your work. If I had been I would have stated that the approximation region should be based on severe testing in the sense of Mayo. The omission will be rectified in any future work I manage to have published. Thus for a data set consisting entirely of zeros and the model N(mu,1) the approximation region would be empty if the approximation region was based on the mean, standard deviation, outlyingness and distance from the model and not just the mean. This is one reason why it is not a confidence region.

Laurie: Thank you! It’s so rare to get agreement or feel one’s made progress in this arena. All the more reason to check that I really understand your last remark. Rather than guess, can I ask you to complete your sentence (alluding to severe testing, if that is your point):

“Thus for a data set consisting entirely of zeros and the model N(mu,1) the approximation region would be empty if the approximation region was based on the mean, standard deviation, outlyingness and distance from the model” because ________________________

Michael, Deborah, Oliver: In general it is not easy to define an approximation region. Oliver (above) gave a reference to a paper by Wood on the Ricker model for population growth. The population N_t at time t satisfies N_(t+1)=r N_t exp(-N_t-sigma Z_t) where Z_t is Gaussian white noise. Observed is y_t where y_t is a Poisson variable with mean phi N_t. The parameters are (r,sigma,phi). A few simulations show that such data is vary variable for the parameters chosen by Wood (r,sigma,phi)=(49,0.3,10). It is not easy to characterize such data sets. The dynamics of the system have to play a role. The process N_t is a Markov process so I would look at pairs of observations. As a first attempt I base the approximation region on the number of times a y(t)>0 is followed by y_(t+1)=0, the number of times y_t>0 is followed by y_(t+1)>0 the largest observation and the median value of the local maxima. For given (r,sig,phi) the distributions of these four statistics can be determined by simulation. This is done over a grid of values of the parameters and having specified alpha an approximation region can be determined. For any parameter constellation the degree of approximation can be measure by the minimum of the four p-values. For the data given by Wood the best approximation is for the parameters (51.4,0.20,10.2) with p-values (0.260,0.438,0.296,0.322) respectively for the four statistics. The cut-off point for the p-values with alpha=0.9 is 0.0125. The point is that I can use the same approach for approximating the copper data and the Richter data.

Given a hypothesis Ho which specifies one or a set of models one way of defining the p-values is as follows. It is 1-alpha^* where alpha^* is the smallest value of alpha such that the approximation region based on alpha has a non-empty intersection with the models specified by Ho. In this sense a p-value is a measure of approximation. There are other possible ways of defining a p-value as a measure of approximation and other ways of measuring approximation which are not based on p-values.

Laurie: What matters is, not being able to explicitly define it, saying something that captures the intended goal or rationale. That is why I’m hoping you will try to complete the sentence using severity, even recognizing that its implementation is complex & thorny.

Deborah, assuming the sample size is at least 2 the simplest answer is

because the Kuiper distance of 1 between the empirical and the model distributions is not typical under the model for any model in the family.

To expand, you have to base the decision as to whether to accept the model or not on the values of a small number of statistics. The statistics you choose will depend on the problem at hand and on the sort of deviation you wish to guard against. Thus in the copper example you may wish to estimate the true amount of copper by the mean because this has proved to be a good choice (it hasn’t but we will pretend it has). Interlaboratory tests are a form of quality control. The variance of the sample gives some measure of the precision of the measurement and the care with which the measurements were made. So we include the mean and the variance. Outliers are common so we wish to prevent them influencing the results. Finally if we are to base the analysis on the family of Gaussian model we check whether the shape of the empirical distribution function is close to that of the model.

We have here four statistics and an amount 1-alpha probability to play with. If we treat each of these statistics equally we can spend (1-alpha)/4 on each. The outlyingness and the Kuiper distance are both one-sided, we are only concerned as to whether they are too large. These give one-sided limits. For the mean and the standard deviation we take two-sided limits as we will reject the model if for example the variance of the data is too small or if it is too large. For the mean we have (1-alpha)/4 to play and treating too large a mean and to small a mean in the same manner we have (1-alpha)/8 to spend on each. Thus the limits for the variance are

qchisq((1-alpha)/8,n) <=sum_i^n(x_i-mu)^2/sigma^2 <= qchisq(1-(1-alpha)/8,n).

To come back to the Kuiper distance. the data are all equal to zero and hence all Lie in the interval [0,0]. The empirical measure of this interval is 1. However for any normal distribution the measure of the interval [0,0] is zero. Hence the distance is 1. If you have data with at least two distinct values the distance is at most 1/2.

The approximation region is constructed in such a manner that if the data really are N(mu,sigma^2) for some (mu,sigma) then the pair (mu,sigma) will belong to the approximation region with probability (at least) alpha. If I understand you correctly then severe testing means that if the data were generated under some distribution P not in the class of models then the approximation region should be empty. One can do this by simulation. How large must the sample be before data generated under the Cauchy model can be reliably declared as non-Gaussian, or the t_4 model, or the log-normal model, or the exponential model, or the binom(1,1/2) model, or how large must the outlier be before it can be reliably detected etc? Thus severe testing must be a guideline when constructing approximation regions.