**Methodology and Ontology in Statistical Modeling: Some error statistical reflections **(Spanos and Mayo)**—**uncorrected

** **Our presentation falls under the second of the bulleted questions for the conference (conference blog is here):

**How do methods of data generation, statistical modeling, and inference influence the construction and appraisal of theories?**

Statistical methodology can influence what we think we’re finding out about the world, in the most problematic ways, traced to such facts as:

- All statistical models are false
- Statistical significance is not substantive significance
- Statistical association is not causation
- No evidence against a statistical null hypothesis is not evidence the null is true
- If you torture the data enough they will confess.

(or just omit unfavorable data)

These points are ancient (lying with statistics, lies damn lies, and statistics)

People are discussing these problems more than ever (big data), but it’s rarely realized is how much certain methodologies are at the root of the current problems.

__________________1__________________

**All Statistical Models are False**

Take the popular slogan in statistics and elsewhere is “all statistical models are false!”

What the “all models are false” charge boils down to:

(1) the statistical model of the data is at most an idealized and partial representation of the actual data generating source.

(2) a statistical inference is at most an idealized and partial answer to a substantive theory or question.

- But we already know our models are idealizations: that’s what makes them models
- Reasserting these facts is not informative,.
- Yet they are taken to have various (dire) implications about the nature and limits of statistical methodology
- Neither of these facts precludes the use of these to find out true things
- On the contrary, it would be impossible to learn about the world if we did not deliberately falsify and simplify.
__________________2__________________

- Notably, the “all models are false” slogan is followed up by “But some are useful”,

- Their usefulness, we claim, is being capable of adequately capturing an aspect of a phenomenon of interest

- Then a hypothesis asserting its adequacy (or inadequacy) is capable of being true!

Note: All methods of statistical inferences rest on statistical models.

What differentiates accounts is how well they step up to the plate in checking adequacy, learning despite violations of statistical assumptions (robustness)

__________________3__________________

**Statistical significance is not substantive significance**

Statistical models (as they arise in the methodology of statistical inference) live somewhere between

- Substantive questions, hypotheses, theories H

- Statistical models of phenomenon, experiments, data: M

- Data x

What statistical inference has to do is afford adequate link-ups (reporting precision, accuracy, reliability)

__________________4__________________

Recent Higgs reports on evidence of a real (Higg’s-like) effect (July 2012, March 2013)

Researchers define a “global signal strength” parameter

H_{0}: μ = 0 corresponds to the background (null hypothesis),

μ > 0 to background + Standard Model Higgs boson signal,

but μ is only indirectly related to parameters in substantive models

As is typical of so much of actual inference (experimental and non), testable predictions are statistical:

They deduced what would be expected statistically from background alone (compared to the 5 sigma observed)

in particular, alluding to an overall test S:

Pr(Test S would yields d(**X**) > 5 standard deviations; H_{0}) ≤ .0000003.

This is an example of an *error probability.*

__________________5__________________

**The move from statistical report to evidence **

The inference actually **detached** from the evidence can be put in any number of ways

*There is strong evidence for H: a Higgs (or a Higgs-like) particle.*

An implicit principle of inference is

Why do data **x**_{0 }from a test S provide evidence for rejecting H_{0} ?

Because were H_{0} a reasonably adequate description of the process generating the data would (very probably) have survived, (with respect to the question).

Yet statistically significant departures are generated: July 2012, March 2013 (from 5 to 7 sigma)

Inferring the observed difference is “real” (non-fluke) has been put to a severe test

Philosophers often call it an “argument from coincidence”

(This is a highly stringent level, apparently in this arena of particle physics smaller observed effects often disappear)

__________________6__________________

Even so we cannot infer to any full theory

That’s what’s wrong with the slogan “Inference to the “best” Explanation

Some explanatory hypothesis T entails statistically significant effect.

Statistical effect x is observed.

Data x are good evidence for T.

The problem: Pr(T “fits” data x; T is false ) = high

And in other less theoretical fields, the perils of “theory-laden” interpretation of even genuine statistical effects are great

**[**Babies look statistically significantly longer when red balls are picked from a basket with few red balls:

Does this show they are running, at some intuitive level, a statistical significance test, recognizing statistically surprising results? It’s not clear]

__________________7__________________

The general worry reflects an implicit requirement for evidence:

*Minimal Requirement for Evidence.* If data are in accordance with a theory T*,* but the method would have issued so good a fit even if *T *is false, then the data provide poor or no evidence for *T. *

The basic principle isn’t new, we find it Peirce, Popper, Glymour….what’s new is finding a way to use error probabilities from frequentist statistics (error statistics) to cash it out

To resolve controversies in statistics and even give a foundation for rival accounts

__________________8__________________

**Dirty Hands**: But these statistical assessments, some object, depend on methodological choices in specifying statistical methods; outputs are influence by discretionary judgments: *dirty hands argument*

While it is obvious that human judgments and human measurements are involved, (like “all models are false”) it is too trivial an observation to distinguish how different account handle threats of bias and unwarranted inferences

** **

Regardless of the values behind choices in collecting, modeling, drawing inferences from data, I can critically evaluate how good a job has been done.

(test too sensitive, not sensitive enough, violated assumptions)

__________________9__________________

An even more extreme argument, moves from “models are false”, to models are objects of belief, to therefore statistical inference is all about subjective probability.

By the time we get to the “confirmatory stage” we’ve made so many judgments, why fuss over a few subjective beliefs at the last part….

**George Box **(a well known statistician) “the confirmatory stage of an investigation…will typically occupy, perhaps, only the last 5 per cent of the experimental effort. The other 95 per cent—the wondering journey that has finally led to that destination—involves many heroic subjective choices (what variables? What levels? What scales?, etc. etc…. Since there is no way to avoid these subjective choices…why should we fuss over subjective probability?” (70)

It is one thing to say our models are objects of belief, and quite another to convert the entire task to modeling beliefs.

We may call this shift from *phenomena to epiphenomena* (Glymour 2010)

** **

Yes, there are assumptions, but we can test them, or at least discern how they may render our inferences less precise, or completely wrong.

__________________10__________________

The choice isn’t full blown truth or degrees of belief.

We may warrant models (and inferences) to various degrees, such as by assessing how well corroborated they are.

Some try to adopt this perspective of testing their statistical models, but give us tools with very little power to find violations

- Some of these same people, ironically, say since we know our model is false, a criterion such as high power to detect falsity is not of great interest. (Gelman).

- Knowing something is an approximation is not to pinpoint where it is false, or how to get a better model.

[Unless you have methods with power to probe this approximation, you will have learned nothing about where the model stands up and where it breaks down, what flaws you can rule out, and which you cannot.]

__________________11__________________

**Back to our question**

How do methods of data generation, statistical modeling, and analysis influence the construction and appraisal of theories at multiple levels?

- All statistical models are false
- Statistical significance is not substantive significance
- Statistical association is not causation
- No evidence against a statistical null hypothesis is not evidence the null is true
- If you torture the data enough they will confess.

(or just omit unfavorable data)

These facts open the door to a variety of antiquated statistical fallacies, but the all models are false, dirty hands, it’s all subjective, encourage them.

From popularized to sophisticated research, in social sciences, medicine, social psychology

* “We’re more fooled by noise than ever before, and it’s because of a nasty phenomenon called “big data”. With big data, researchers have brought cherry-picking to an industrial level”. (**Taleb**, Fooled by randomness 2013) *

It’s not big data it’s big mistakes about methodology and modeling2

__________________12__________________

*This business of cherry picking falls under a more general issue of “selection effects” that I have been studying and writing about for many years. *

*Selection effects** come in various forms and given different names: double counting,hunting with a shotgun (for statistical significance) looking for the pony, look elsewhere effects, data dredging, multiple testing, p-value hacking*

One common example: A published result of a clinical trial alleges statistically significant benefit (of a given drug for a given disease), at a small level .01, but ignores 19 other non-significant trials actually make it easy to find a positive result on one factor or other, even if all are spurious.

The probability that the procedure yields erroneous rejections differs from, and will be much greater than, 0.01

(nominal vs actual significance levels)

How to adjust for hunting and multiple testing is a separate issue (e.g., false discovery rates).

__________________13__________________

If one reports results selectively, or stop when the data look good, etc. it becomes easy to prejudge hypotheses:

Your favored hypothesis* H *might be said to have “passed” the test, but it is a test that lacks stringency or severity.

(our minimal principle for evidence again)

*Selection effects alter the error probabilities of tests and estimation methods, so at least methods that compute them can pick up on the influences*- If on the other hand, they are reported
*in the same way,*significance testing’s basic principles are being twisted, distorted, invalidly used - It is not a problem about long-runs either—.

We cannot say about the case at hand that it has done a good job of avoiding the source of misinterpretation, since it makes it so easy to find a fit even if false.

__________________14__________________

The growth of fallacious statistics is due to the acceptability of methods that declare themselves free from such error-probabilistic encumbrances (e.g., Bayesian accounts).

Popular methods of model selection (AIC, and others) suffer from similar blind spots

Whole new fields for discerning spurious statistics, non-replicable results; statistical forensics: all use error statistical methods to identify flaws

(Stan Young, John Simonsohn, Brad Efron, Baggerly and Coombes)

- All statistical models are false
- Statistical significance is not substantive significance
- Statistical association is not causation
- No evidence against a statistical null hypothesis is not evidence the null is true
- If you torture the data enough they will confess.

(or just omit unfavorable data)

To us, the list is not a list of embarrassments but justifications for the account we favor.

__________________15__________________

Models are false

Does not prevent finding out true things with them

Discretionary choices in modeling

Do not entail we are only really learning about beliefs

Do not prevent critically evaluating the properties of the tools you chose.

A methodology that uses probability to assess and control error probabilities has the basis for pinpointing the fallacies (statistical forensics, meta statistical analytics)

These models work because they need only capture rather coarse properties of the phenomena being probed: the error probabilities assessed are approximately related to actual ones.

Problems are intertwined with testing assumptions of statistical models

The person I’ve learned the most about this is Aris Spanos who will now turn to that.

__________________16__________________

Here I am delayed at LGA airport, so digging into whatever’s “on draft” (in the blog, I mean, not the bar), I see my half of the Mayo /Spanos “dog and pony” show– in the form of my slides– from the Onto-Meth conference (I’m the pony), which I have just posted.

*A couple of things I recall from the discussion (send corrections if you were there):

Glymour raised an important question/concern regarding adjustments for “cherry picking” that will be familiar to readers of this blog. There were several parts to his question, but in a nutshell, the issue is whether to adjust, and if so, how. If only one statistically significant effect is published, someone might not object, but discovering 19 or 99 other non-significant effects (concerning the same drug or other treatment, say), perhaps unreported, may alter the assessment of the first. Does this make sense? should I really object, Glymour was asking? Answer: yes! Should I use a (conservative) Bonferoni adjustment or something else? Etc. My reply to someone who argues it’s too hard to take into account selection effects from a bunch of studies is this:

(a) It’s easy to demonstrate that ignoring selection effects will make it incredibly easy to find “nominally “ significant results due to chance, when in fact you should be curbing your enthusiasm.

(b) I don’t need to show how to arrive at an adjustment, the onus is on the researcher to show their reported error probabilities hold (approximately). If what they’ve done makes it too hard or impossible for statisticians to figure out how to adjust (as often happens), then all the results must be set aside or thrown out until subsequent data is available.

This is the reply I recommend anyone who cares about fraud and non-replicable results. This is NOT just because they announced a further delay of my flight just now…

There was a statistician, or someone who said they practiced statistics, in the audience who gave an a helpful reply, and I meant to speak to him afterwards, but then did not see him. Anyone know who that was? I don’t mean Tom Kepler who also gave an excellent reply.

And by the way, what was so funny about Taleb?

“These models work because they need only capture rather coarse properties of the phenomena being probed: the error probabilities assessed are approximately related to actual ones.”

Everything you say rests on this. But in real problems, this isn’t known, usually isn’t true, and is almost never needed. There’s no hope of progress until this is realized. We don’t need to go any further than the simplest statistical model to see all three points. Assume you’re trying to weight a philosophy professor and use the model y_i=weight +e_i and assume the errors are IID N(0,sigma).

THIS ISN’T KNOWN: Even if error frequencies were found to be IID N(0,sigma) in the past there is absolutely no guarantee that’s true for the current or future data sets. This is always an assumption whose truth is unknown at the time the inference is made. If you knew e_i precisely enough to check this assumption, then you could calculate weight = y_i-e_i directly without any need for statistics at all.

USUALLY ISN’T TRUE: Statisticians almost never check that the data generating mechanism (i.e. the weight scale) has IID N(0,sigma) errors. On the rare occasions when such checks were made, they are usually found to be false, leading statisticians to lament the impossibly effective overuse of the NIID assumption. Nevertheless in violation of your philosophy, all statisticians continue to assume IID Normal whenever they feel like it.

ALMOST NEVER NEEDED: The condition that the assessed error frequencies be approximately related to actual ones is a sufficient condition, but it’s very far from being a necessary one. In reality all you need is the actual errors in the data satisfy the “typicality condition” which someone (Spanos?) mentioned in a previous post. In this case the typicality condition reduces to the actual errors being a typical example where there is at least a little cancellation.

For a weight scale with errors on the order of 1lb (=sigma) or less, then the errors could be e_1,…,e_10=1,.8,.6,.4,.2,-.2,-.4,-.6,-.8,-1. With these errors and weight = 130 lbs, you get a 95% confidence interval equal to (129.38lb, 130.61lb) which accurately informs the statistician of where the true weight lies. This worked out great even though the errors weren’t Random, Identically distributed, or Normally Distributed in any way whatsoever.

The thing which you base everything on absolutely isn’t required at all, which is why Gelman’s book recommends you not waist time checking normality assumptions in regression models.

I invite you conduct this experiment in real life and see if the error are really IID N(0,sigma) and then compute CI (or equivalently the Bayesian Credibility interval for uniform prior) and see if it leads you astray as your philosophy predicts.

What’s actually happening in these problems is that you know the general region where the error vector must reside. You get this region from the observation that the individual errors are on the order of 1lb using that weighing device. This is the only information we actually have about the errors. Then you guess that there will be some cancellation of errors since almost every error vector in that region has “some” cancellation. The guess turns out to be fairly successful in practice because, again, almost everything that could have been true implies some cancellation. Even if the guess turns out to be wrong, you couldn’t have done any better with the information you actually had about the errors. The random IID Normality of frequencies assumptions aren’t used or needed.

JohnQ: Read this quickly in the airport, but I’m confused as to your allegations. I didn’t say we needed to know Normal IID and what we require is less than what you claim “you know”. I think there is confusion about error probabilities. Will reread tomorrow… unreliable connection.

Maybe a condensed version? If you assume (as an example) NIID in a problem, you think this means the error frequencies must be approximately NIID and success depends on whether this assumption is true or not.

In reality, the point of the NIID assumption is to define a region where the error vector must lie. The only thing needed is that the actual error vector present in the data lie in this region (essentially Spanos’ typicality condition). As long as that’s true, It’s irrelevant whether the histogram of those errors looks like a normal distribution or not. In real life it usually doesn’t.

This is why text books are backing away from recommending not checking normality assumptions in regression models is needed. See Gelman’s Regression and Multilevel/Hiearchical Models book bottom of page 45.

It’s also why statisticians can get away with assuming normality and build lots of great models without ever checking whether things like “weight scales” actually have N(0,sig) error frequencies.

I meant: “This is why text books are starting to not recommend checking normality assumptions in regression models.”

Sorry for the typos.

“Nevertheless in violation of your philosophy, all statisticians continue to assume IID Normal whenever they feel like it.”

Not true. (I should know a thing or two about this.;-))

I think that the tricky issue here is that indeed it is not needed that the model really holds, but that *some* kinds of violations of the model will lead to trouble, some of which are detectable. So one should check some but not all aspects of the model assumptions (for example it’s usually (!) not a big problem that data in fact are discrete), but doing this one may still miss issues that may be relevant (irregular but strong dependence, for example). Though there may be no way around the latter problem anyway, so one may just say, fine, we live with that.

Hi Hennig,

Interesting points, but the issue I raised is logically prior to yours. There is a fundamental disagreement between what it means to say “the model holds”.

Mayo stated at the end of her slides (effectively – hopefully this honestly captures her meaning) that it’s correct to say a model like IID N(0,s) holds when the error frequencies look approximately normal, and look independent.

I stated that the NIID model holds when the errors vector actually present in the data lies in the high probability area of the assumed multivariate normal.

I then made three claims:

1. We never know whether “the model holds” in Mayo’s sense when the inference is made, but we often do know that the “the model holds” in my sense. The reasons were given in my first comment.

2. In practice, the vast majority of the time someone assumes NIID, the model doesn’t hold in Mayo’s sense, but it does hold in my sense.

3. If the model fails to hold in Mayo’s sense, even if the failure is extreme, but does hold in mine, then everything works out fine. I gave an explicit example of this in my initial comment above.

Please note, these are empirical or mathematical claims.

JohnQ: Well, Mayo needs to explain herself how what is in her slides relates to what you make of it; I haven’t seen her claiming that a model “holds” in any case, only that it can be used to find out true things, which is a weaker statement.

You however apparently say that a model “holds” under very weak conditions. This doesn’t make much sense to me, because the Niid assumption is actually an assumption about a process, not about a single data set, and it is essential part of the assumption that things happen from time to time in low but nonzero probability regions, too. So I’d indeed subscribe to a use of the term “the model holds” that makes it indeed unobservable whether the model holds, which is in agreement with statistical theory (if whatever happens has nonzero probability under two conflicting models, there is no safe way to tell them apart and all we can use is error probabilities).

“2. In practice, the vast majority of the time someone assumes NIID, the model doesn’t hold in Mayo’s sense, but it does hold in my sense.”

It depends on how you choose your high probability area. Actually the set of vectors of irrational numbers has probability 1 under Niid, but such vectors are never observed… which illustrates that we shouldn’t really discuss about whether such models “hold” in a real situation, but rather only about whether we can make good use of them.

Hennig,

“I haven’t seen her claiming that a model “holds” in any case”

I was baseing Mayo’s position off her state on the last slide and using some poetic license: “These models work because they need only capture rather coarse properties of the phenomena being probed: the error probabilities assessed are approximately related to actual ones.”

“because the Niid assumption is actually an assumption about a process, not about a single data set”

My entire point was that this viewpoint wasn’t correct both in theory or in practice. It represent a sufficient conditon, but not a necessary one. In practice this sufficient condition is not known, not true, and not needed because there are much weaker necessary conditions which are known and are true.

“which is in agreement with statistical theory”

It’s only in agreement with Frequentest statistical theory. My version is in agreement with Bayesian theory which views the distribution as defining the uncertainty, or a reasonable region, for the one set of errors in the one set of data that actually exists. Everything depends on how well we know the one set of errors in the data. If even one error is known exactly, then the weight =y_i-e_i is known exactly, and it’s irrelevant what all past, present, and future errors were. The distribution of errors that don’t exist couldn’t be more irrelevant.

“It depends on how you choose your high probability area”

Yes it does. In the example I gave you know the errors e_i on the order magitude or less than 1lb. So the vector of errors in the data will be in the high probability area of an IID N(0,1lb). That is the essence of how actual true information about the errors is translated into a probability distribution for the errors.

“Actually the set of vectors of irrational numbers has probability 1 under Niid”

Measure theory considerations are irrelevant to the foundations of statistics since we can always replace a continuous distribution with a more realistic discrete one.

“which illustrates that we shouldn’t really discuss about whether such models “hold” in a real situation, but rather only about whether we can make good use of them”

I absolutely disagree. The two different notions of what it means for model to hold in principle has a huge effect on how statistical modeling is done in practice. In particular, if a Bayesian notion of probability is the appropriate way to approach real errors in a repeated experiment then when would you ever need Error Statistics?

JohnQ: Do you agree that with your weak notion of a model to “hold”, two quite different models leading to different conclusions can hold at the same time as long as only the observed data are in a high probability region of them both?

Neither Bayes, as far as I understand, allows different models (for example iid N(0,1) and iid N(0,0.999)) to hold at the same time!?

It is possible for two quite different models to hold at the same time and this is absolutely neccesary in Bayesian statistics. In Bayesian statistics, probability distributions P(errors|knowledge) are conditional on a state of “knowledge”. If greater knowledge is used then the high probability region will be smaller. With greater knowledge there will be less uncertainty as to the actual errors in the data.

In the extreme case, one person could use IID N(0,1) for the errors, while another could know the errors exactly and use a Dirac delta function about the actual errors. In the example originally given, the first modeler would report the true weight is (129.38lb, 130.61lb) while the second would report [130,130]. Both are making correct statements since the true weight is 130lbs and is in both intervals, but the more informed modeler is naturally led to a smaller interval.

JohnQ: Here’s another interesting case in point. For a class in grad school I once measured the frequency distribution of time intervals between being passed on a certain highway while driving at the speed limit. It turns out to roughly follow an exponential distribution with a mean of around 30 seconds. On this information, my probability distribution for time to the next passing event is almost the same as the frequency distribution (sampling variability dominated estimation uncertainty in the posterior predictive distribution). But my probability distribution for the time to the next passing event would change a great deal if I just look in the rear-view mirror.

Agreed completely. I think the problem is that anyone who retains a substantial frequentest intuition, which includes most Bayesians, has an incredibly strong urge to think there can only be one correct model. This is a direct consequence of their freq=prob identification.

A real Bayesian thinking in terms of information can easily imagine a nested sequence of increasingly informative knowledge states K_1 < …< K_n each one of which is true, and hence leads to models which are saying correct things about the real world, but in which the final uncertainty resulting from using K_n is much lower than K_1.

The absurdity is that emphasis on getting the "correct" freq of the errors, which is irrelevant to the problem, cause us to lose focus on one thing that is actually relevant; namely what do we know about the errors that actually exist in the data?

The entropy of the IID N(0,sigma) distribution for n data points is something like n*ln(sigma). That is directly a measure of our knowledge about the true error vector, or equivalently, how well we know where the true error vector lies. Anyone with greater knowledge will be able to assign a distribution with smaller entropy. In the extreme case, if someone knows the error vector exactly, they will assign a distribution with entropy=0.

Those smaller entropy distributions will be “wrong” according to anyone who believes in the identity freq=prob, but nevertheless they will lead to more accurate estimates for the weight. In the extreme case, where the one error vector that actually exists is known precisely (i.e. distribution where entropy=0), the weight will then be known precisely as well since weight = Y_i-e_i.

And in all of this it really makes no different what the frequency of anything is.

So? Different ways of carving things up, different measurements, different questions,…hopefully with some interconnected checks on the measurements if this is science–What’s new? It was, by the way, a point by Laura Ruetsche at our Onto-Meth conference wrt philo of physics….Or Senn’s recent post on modeling velocity. It would help John Q to read some of my philosophy of science, so my view isn’t quite so often caricatured by you. I can send you EGEK if you give me an address, else try my chapters in E & I (2010).

Mayo, not sure what the “so?” was a response to, but I read your work all the time and take it very seriously. Otherwise I wouldn’t be writing long comments here. There was no desire to caricature anything (or to imply that philosophy professor weight as much as 130lbs!). I’m absolutely not interested in wasting everyones time that way.

Even before considering any ideas like “prior” or “severity” or anything else someone might think about, there is the question of what is the NIID really? There seems to be this belief among both frequentest and most Bayesian that it roughly a statement about the (stable) histogram of errors thrown off by the weight scale, while priors have an entirely different type of justification.

What I’m saying is that even in this example where we are dealing with real errors in a repeated measurement scenario, we should still think of the NIID the same way we do priors: namely, there is one true error vector in the data and we need a distribution which puts that vector in it’s high probability region.

This is the approach which is justified by what we know (we don’t know current or future freq of errors) and matches what we do in practice (no statistician ever verifies that a weight scale actually gives “independent” errors in the shape of N(0,sigma)).

JohnQ: That different distributions can hold in subjective Bayes for two different persons is a different issue. For subjective distributions there is no condition that the observed data must be in the high probability region of the distribution, so this seems unconnected to what you wrote before.

My understanding is that in subjective Bayesianism a model “holds” for a person if and only if it reflects the person’s subjective beliefs (which means, in case that the person came up with a prior before observing the data, the person arrived at the distribution by Bayesian updating). I’m fine with that, but it’s quite different from what you originally wrote. Still no two different distributions hold for the same person for the same situation (unless you mean “a mixture of them”).

Hennig:

I wasn’t talking about “Subjective Bayes”. I never even hinted at Subjective Bayes and don’t know much about it. I’m talking about Objective Bayes. I stand by my previous remarks.

If the true weight is 130lbs then various models might give interval estimates of:

(100,200)

(120,140)

(129,130)

(129.99,130.01)

Every one of the interval estimates correctly identifies the region where the true weight lies and there is absolutely no reason no to consider them all “correct”. Obviously the latter estimates were based on more knowledge and lead to less uncertainty, but that doesn’t change the basic point.

Some of us are interested in learning about the world not in what different people claim to know, wish to know, really believe, aspire or the like. We avoid the erroneous shift from phenomena to epiphenomena (mentioned in p. 10 of my slides).

So by using an unverified and generally untrue assumptions that real errors have a stable pattern and you know what that pattern is ahead of time, you are “learning about the real world”.

Yet if I use the true fact that the errors from a given device are on a given order of magnitude or less to show the true vector of errors lies in a certain region them I’m engaged in “what different people claim to know, wish to know, really believe”?

“We avoid the erroneous shift from phenomena to epiphenomena”

Rereading this it occurs to me you think all this Bayesian talk about “states of knowledge” and the like is a superfluous addition on top of a frequentest foundation.

But I stress that I’m not merely giving a different conceptualizing the same basic facts. Rather the Bayesian and Frequentist understanding of the example rely on, and imply, different facts about the real world. And when you look at those facts in the real world they strongly favor the Bayesian chain of reasoning, even though the example was about errors in repeated measurement (which would seem to be home field for Error Statistics)

Christian: JohnQ isn’t a subjective Bayesian — he’s a Jaynesian, like me. This position doesn’t fall easily into the subjective/objective dichotomy. It holds that different states of information are encoded by different probability distributions, somewhat like subjective Bayesianism. It also takes as axiomatic that if two different agents have the same state of information, then they must have the same probability distribution; this is not an axiom or a consequence of the subjective Bayesianism. (On page 1 of Jay Kadane free textbook “Principles of Uncertainty”, he writes, “Before we begin, I emphasize that the answers you give to the questions I ask you about your uncertainty are yours alone, and need not be the same as what someone else would say, even someone with the same information as you have, and facing the same decisions.”)

In short, the distinction between subjective Bayesianism and Jaynes’s views is precisely that Jaynesians strive to ensure that the uncertain variables take values in the high probability region of the distribution.

OK, I hadn’t found out by the postings before that JohnQ is a Jaynesian.

Anyway, my interpretation of Jaynes still would be that given a certain state of information, only one distribution is “correct”, despite the fact that one can imagine other states of information that make other distributions correct, whereas I originally had understood John saying that *all distributions for which the data lie in a high probability region hold in any case*.

Ah, confirmed by Corey already! So just a misunderstanding probably.

Corey: And is this high probability region just more beliefs about the region? or the region it actually arises,( probably)? What is it a measure of?

Mayo, I really don’t understand where all this “beliefs” stuff comes from.

Knowing an upper bound for the order of magnitude of errors on a given weighing device is far easier to verify empirically and rests on far safer objective grounds than does some mythical future long range frequencies. In real life, the histogram of errors usually isn’t stable. Not even approximately. Let alone known ahead of time.

So why do you keep dismissing this real information that the Bayesian uses as mere “beliefs”? Not once in all that I’ve written above did ever even mention “beliefs”.

JohnQ: Getting back to your comment, I just wanted to note that you missed the point of my very clipped, abbreviated statement on my last slide (always a danger with slides): The relevant part is “These models work because they need only capture rather coarse properties of the phenomena being probed”. This is as opposed to having to be literally true or complete in any sense. We ask a very rough and abstract question such as, what’s the relative frequency of a property in some group” or whatever. Then all the statistical work is on the derived questions, and the error statistical link-ups between what is observed and a host of possible answers to the coarse statistical question. This, in turn, can permit indirect learning about the full-bodied original phenomenon. (e.g., recent blogs on the Higgs boson.)

Mayo,

I did get your point. We just have different ideas of what those coarse properties are. You think those coarse properties are a kind of roughly stable histogram of the errors. I’m claiming this is unknown, generally untrue, and not needed. Rather the relevant “course property” is that if the true errors are in the high probability region of the P(errors) and almost all of the errors in that region involve some cancellation, then simply predicting “some cancellation” will be a highly robust guess.

That is the essence of what the Bayesian is doing in this case. It doesn’t require, even approximately, the histogram of the errors to be anything like the assumed distribution. It merely requires that he errors actually existing lie in the high probability region of the assumed distribution. This is both generally true for a distribution sufficiently spread out and possible to know at the time of the inference.

If you disagree, then there are at least two consequences. First, Gelman’s recommendation that Normality of residuals needn’t be checked in regression models is wrong. And second, if a statistician assumes that the errors from a device like a “weight scale” should be IID N(0,sigma), then statisticians should actually verify whether thats actually true of the given weight scale. Currently they basically never do.

JohnQ: You are still very far from the point of my claim which had to do with the relationship between substantive and statistical information. Never made the claim about the histogram. I am also not mainly interested in prediction, or in predicting P(errors), or ‘some cancellation’ of errors, whatever these notions can mean.

I’m not interested in prediction either. I never mentioned it. We can’t get to those other issues because there is fundamental disagree about what we’re really doing when we use a probability distribuiton. There’s no point in adding another floor to ones house if the foundation is built on mud.

“Never made the claim about the histogram”

Are you saying that when a frequentest assumes a probability distribution P(x) they dont mean that the frequency histogram of the x_i’s looks approxiamtely like P(x)?

JohnQ and Corey: Can you explain the following to me?

If it is a necessary and sufficient (?) condition for a model to “hold” that the observed data are in a (more or less suitably chosen) high probability region of the distribution, how can data ever be observed that are in low probability regions? If they cannot, how can it be justified that the probability of low probability regions is not actually zero?

(If you say that a N(0,1)-model can actually not hold for an observation with absolute value of larger than, say, 6, why wouldn’t the N(0,1)-distribution truncated at +/-6 be the more “correct” model? Are you asking us to not bother about anything in the distribution that happens in low probability regions?)

It may be that I got your whole point wrong but if so, please explain!

Hennig: Great question, but it’s difficult to explain. There’s really no good way to explain this point in a blog comment so please bear with me. Also note it’s the errors that we are giving a probability distribution to and not the data directly.

In principle if you knew the absolute value of the errors was smaller than 6 then you could truncate N(0,1) at +/- 6. Mathematically it would be a lot more inconvenient and would yield essentially the same result. The difference would be something like estimating (120,140) vs (120.01, 139.99) when the true answer is 130.

In general though, the goal is take a kind of “majority vote”. If for every way you can get “no error cancellation” there are X>>1 ways to get “some error cancellation”. Then assuming some error cancellation is both the best guess that can be made and will tend to be a highly robust guess in practice.

The subtlety comes from considering “over what hypothesis space do we use to take the vote?”

In reality each error is a function of the micro-state Z of the universe. You can easily imagine a detailed physical model for Z which depends on a multitude of parameters and states of various atoms in the measuring device. The number of possible micro-states is typically massive; usually something crazy like 10^10^100. Each error is conceived of as a (highly many-to-one) function of one of these states e_i = F(z_i).

The NIID model is really a stand in for that deeper hypothesis space. And it’s over that deeper space that we are really “voting”. Usually you don’t consider Z directly or try to model it, although sometimes this is done. For example, early on Einstein modeled “Z” directly for polymer chains and used it to get the thermodynamics of folding polymers. It’s much more common to never mention Z directly and just work with P(errors). In essence P(errors) is a substitute for Z and by computing expected values/probabilities over P(errors) you are really taking that “majority vote” over the space Z. This point is key.

There is no guarantee this “majority vote” will be right though. Just because almost every state in Z leads “some cancellation of errors” doesn’t mean this is true for every state. And if in reality you’re in one of those minority states were there is no cancellation of errors you’ll get what looks like a low probability event (according to P(errors)).

Since in practice we’re not counting states in Z directly, we have to get at P(errors) using whatever we do know. Assuming N(0,sigma) is saying “it’s possible to get a larger error but most of the possibilities in Z give errors closer to the origin”.

So the difference between the Bayesian and Frequentest is this. The Bayesian claims nothing more than “most possibilities in Z give some cancellation of errors”. The Frequentest claims “each possibilities in the space Z of size 10^10^100 will occur equally often if you take measurements long enough”. Frequentests believe the latter is a hard fact, easily verified, which leads to objectively true statements about the real world, while the former is pie-in-the-sky subjective wishful thinking. I claim the exact opposite is true.

JohnQ: Thanks for the effort! I have to admit that you’re probably right that this is too complex for a blog comment, so I feel that I miss a number of points in this explanation, which would probably need quite some effort to fill. Can you point me, for example, to something in Jaynes’s “Probability Theory” book where this “modelling from high probability regions”-aspect is explained?

Some comments:

1) The term “error” implies that a proper distinction can be made between a “true” value which you apparently think of as fixed and unobservable and some random “annoying distortion” of the truth. Which to me seems like a conception that is quite close to an objective frequentist one. (I’m somewhat wary of the term “error term” in frequentism; this suggests, in many applications wrongly, that a value of 0 would be “better”, in some sense, than others.)

2) Actually neglecting low probability regions in modelling can have serious consequences. For example, I may prefer a t-distribution to the Normal to allow for more extreme outliers even before I have seen one, and this may change quite a bit about later analyses, particularly if gross outliers turn up later, despite “high probability regions” being roughly the same.

3) Last paragraph: With my frequentist hat on (I’m a pluralist), I wouldn’t sign up to what you write there about frequentists at all. I’d say that a frequentist model is a (potentially useful) thought construct, not itself a hard fact, and that it can be checked in a falsificationist manner, without ever claiming that it is verified. I think many if not most frequentists would agree.

Christian: On #1, Jaynes’s section on the physics of coin tossing in a concrete instantiation of the kind of argument JohnQ is making more abstractly above.

Whoops, not your #1 — on the question immediately preceding #1.

Thanks! Much appreciated!