I’m not there. (Several people have asked, I guess because I blogged JSM13.) If you hear of talks (or anecdotes) of interest to error statistics.com, please comment here (or twitter: @learnfromerror)

I’m not there. (Several people have asked, I guess because I blogged JSM13.) If you hear of talks (or anecdotes) of interest to error statistics.com, please comment here (or twitter: @learnfromerror)

Categories: Announcement
7 Comments

Stigler gave the presidential address: The 7 pillars of statistical wisdom:

for some posts on his talk:

http://www.statsblogs.com/2014/08/05/the-7-pillars-of-statistical-wisdom/

http://blogs.sas.com/content/iml/

I am at the JSM currently. It’s my first one and I am stunned how big it is. You can easily miss people you’d like to meet over several days. Also, I am missing out on things that I’d find interesting all the time and what I see is only a tiny very biased sample of what goes on.

Here are a number of things that I saw that somehow relate to topics treated in this blog. On Monday I went to a session on reproducability (there will be another one on Wednesday). Two of the four presentations were about software helping researchers to be reproducible and reliable, which is useful but doesn’t really have philosophical implications. The third presentation was by Jyoti Rayamajhi on strategies to improve the reliability of observational studies, the fourth one by Elizabeth Iorns on “the Reproducibility Initiative”. Both said many good things that people should have in mind (in observational studies one should already think about potential sources for bias and confounding a priori; Iorns complained bitterly about the lack of incentives for replication of studies), although not much of the thoughts was strikingly original. Iorns, together with collaborators, currently try to reproduce some studies connected with cancer and try to work toward the availability of better resources for reproduction. She also had a list of requirements for quality replication, and cited some shocking results about how few studies in medicine actually reproduce result if at all replicated (on Tuesday I saw similarly sobering statistics about the use of “not totally trivial” methods for dealing with missing values in published work in top medical journals).

The next session was a JASA discussion presentation by Bradley Efron, who presented a result and a scheme to run bootstrap so that the variation estimated by bootstrap takes formally well defined model selection into account for estimating prediction uncertainty (actually averaging bootstrap estimators, even from different selected models, improves matters). Discussants presented some alternative approaches for doing this. I think that Hendry claimed this feature for his methods as well, but his work didn’t come up in the discussion. I think that currently the scope of such work is still fairly limited.

Later on the day Stigler presented his “seven pillars of statistical wisdom”, which interestingly only implicitly refer to probability for modelling uncertainty. This kind of presentation is certainly interesting for thinking about the nature (and unity) of the field; not sure whether there were impulses for future work and developments in it, though.

A highlight for me on Tuesday was a session on Distributional Inference, covering modern use of fiducial distributions, confidence distributions and inferential models. The clearest of these presentations for me was the one of Min-ge Xie, who defined confidence distributions as distributions on the parameter space that basically expressed all information about possible confidence sets. All probabilities occurring in these distributions have a straightforward frequentist interpretation, and therefore these are legitimate frequentist devices for inference (although certainly somebody will misinterpret them like all the other devices we know). Modern fiducial inference had been revived by Jan Hannig in the last 5-10 years, who here acted as discussant and session organizer. In some situations his fiducial distributions are confidence distributions and have therefore a proper frequentist interpretation as well, but in others they are not. Hannig admitted that in general the meaning of these distributions may be a bit obscure, but advertised them as pragmatic means to solve certain problems (there was one presentation on regression model selection in massive datasets). There were also two presentations on Inferential Models by Chuanhai Liu and Ryan Martin, who have an impressive series of publications about this, some of which are quite foundational. Unfortunately these presentations went through the material too quickly so that I could only catch a glimpse and don’t feel qualified to write too much about this. Liu and Martin as well as Hannig claim that these approaches for expressing the uncertainty about a parameter in the shape of a distribution without the need of specifying a prior could unify frequentists and Bayesians (once more). This may be somewhat exaggerated but this material is certainly worth looking at.

Later on Tuesday there was the Deming lecture by Sharon Lohr on connecting some of Deming’s ideas with research on the quality of education. I liked the fact a lot that the statistician Deming had warned decision makers so clearly about the dangers of ranking and setting quantitative targets, a very good illustration that statistical insight may actually prevent us from giving numbers too much authority.

Christian: Thanks so much for this! I’ll come back to comment later. It’s interesting that 3 of the people you mention on confidence distributions and such were discussants on my Birnbaum paper:

Martin and Liu, and Hannig

https://errorstatistics.files.wordpress.com/2014/08/martinsts1312-016ra0.pdf

https://errorstatistics.files.wordpress.com/2014/08/hannsts1401-003ra0.pdf

I was glad to see, judging from this person’s blog, that Stigler focussed on some topics that appear (to me) to be underplayed these days, especially the “pillars” of data reduction, experimental design, and model testing via residuals. http://blogs.sas.com/content/iml/2014/08/05/stiglers-seven-pillars-of-statistical-wisdom/

Measuring “uncertainty” (something Christian brought up) does occur in #3. I wasn’t there but this looks to be a pretty complete description. Please write if you have other impressions and info.

1. Aggregation: It sounds like an oxymoron that you can gain knowledge by discarding information, yet that is what happens when you replace a long list of numbers by a sum or mean. Every day the news media reports a summary of billions of stock market transactions by reporting a single a weighted average of stock prices: the Dow Jones Industrial Average. Statisticians aggregate, and policy makers and business leaders use these aggregated values to make complex decisions.

2. The law of diminishing information: If 10 pieces of data are good, are 20 pieces twice as good? No, the value of additional information diminishes like the square root of the number of observations, which is why Stigler nicknamed this pillar the “root n rule.” The square root appears in formulas such as the standard error of the mean, which describes the probability that the mean of a sample will be close to the mean of a population.

3. Likelihood: Some people say that statistics is “the science of uncertainty.” One of the pillars of statistics is being able to confidently state how good a statistical estimate is. Hypothesis tests and p-values are examples of how statisticians use probability to carry out statistical inference.

4. Intercomparisons: When analyzing data, statisticians usually make comparisons that are based on differences among the data. This is different than in some fields, where comparisons are made against some ideal “gold standard.” Well-known analyses such as ANOVA and t-tests utilize this pillar.

5. Regression and multivariate analysis: Children that are born to two extraordinarily tall parents tend to be shorter than their parents. Similarly, if both parents are shorter than average, the children tend to be taller than the parents. This is known as regression to the mean. Regression is the best known example of multivariate analysis, which also includes dimension-reduction techniques and latent factor models.

6. Design: R. A. Fisher, in his ASA Presidential Address (1938) said “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” A pillar of statistics is the design of experiments, and—by extension—all data collection and planning that leads to good data. Included in this pillar is the idea that random assignment of subjects to design cells improves the analysis. This pillar is the basis for agricultural experiments and clinical trials, just to name two examples.

7. Models and Residuals: This pillar enables you to examine shortcomings of a model by examining the difference between the observed data and the model. If the residuals have a systematic pattern, you can revise your model to explain the data better. You can continue this process until the residuals show no pattern. This pillar is used by statistical practitioners every time that they look at a diagnostic residual plot for a regression model.

Please see comment from Norm Matloff http://errorstatistics.com/2014/08/06/what-did-nate-silver-just-say-blogging-the-jsm-2013/#comment-89656

Before I wrote some more, I should probably say (because I didn’t comment much on Stigler’s presentation before) that I think that his choice of pillars is quite original. He tried to get at some unifying but not at-first-sight-obvious principles (I mean, most are kind of obvious when you see them but I think that if other statisticians would have been asked to compile their lists, Stigler’s would have seemed to be somewhat outlying). So as a perspective this is very valuable, regardless of whether we agree with this or not.

Main impression of Wednesday: The reproducibility crisis in science is now a quite fashionable topic in statistics, probably because statisticians have their hands in both what is seen to be the problem and what people (mostly statisticians;-)) think that could solve it. Andreas Buja started his presentation about taking into account model selection for later inference with a review of reproducibility problems and singled out selection effects as a major reason. Actually his topic was pretty much the same as the topic of Efron the day before, but instead of fancy resampling, Buja went for good old adjustment for multiple testing, evaluating the number of models that would have been potentially possible. This seems very conservative and not very mathematically appealing, but is quite flexible and also honest (if it is applied in an honest way). In the same session Christian Robert talked about the ABC, using classification methods such as random forests to help with Bayesian model selection to the effect that there is no full posterior at the end. Like it or not, this seemed quite creative and original to me and actually viewing model selection as a classification problem was a new insight for me (which doesn’t necessarily have to be Bayesian; probably this idea could be used from a frequentist perspective, too).

Then there was a very well attended “late breaking session” with more on reproducibility. Yoav Benjamini’s title was “it’s not the p-values fault”; he basically commented on all kinds of selection biases and problems, which cannot be solved by use of confidence intervals and Bayesian methods either. Marcia McNutt and Philip B. Stark discussed reproducibility problems in a wider framework. Apart from issues with selection and the choice of statistical methodology, there are all kinds of further issues like lack of transparency and proper protocolling, uncritical reference to “authorities”, issues with design and selecting samples, all kinds of errors starting from coding errors of the used software, and also a definition problem of what constitutes a reproduction. Not only results fail to reproduce when tried, sometimes it also turns out that studies cannot be reproduced because of lack of information or inconsistent information; also people should try to replicate studies not only precisely but also approximately to test robustness of results.

Actually, despite I was very skeptical about the methodology and arguments in Ioannidis’ famous paper “why most research results are wrong” (or something), I can’t help feeling that the amount of things that go wrong in science is indeed very, very worrying, and the vast majority of this doesn’t have to do with frequentist vs. Bayesian methods at all.

In the afternoon I had my own discussion of the session on classification and clustering, which, when getting the presentations I was meant to discuss, pushed me more in the direction of commenting on foundations of statistics than I would have thought. I discussed the use of the term “confidence” for a fancy resampling method for “testing” multimodality of a density by Werner Stuetzle, and I also commented on two Bayesian presentations that used data-dependent priors, so that nobody could really understand what the psteriors mean. Many Bayesians these days seem to be quite unimpressed by the lack of meaning. Actually, I finally agreed with Russell Steele (one of the “culprits”) that clustering needs a lot of tuning decisions because researchers in different applications need the resulting clusters to have different properties, and that Bayesian priors could well be used for such tuning, except that this is not how the Byesians sell them.

Afterwards, Garce Wahba gave the Fisher-prize presentation and apologized for treating the audience to quite heavy mathematics on spline- and reproducing kernel Hilbert-space based generalised “Analysis of Variance”-decompositions; quite inspiring though too heavy to really get the details without going into the literature.

Another highlight for me today (Thursday) was a session giving an overview of statistical problems in astronomy. Interestingly, in these two presentations by Chad Schafer and David van Dyk, another foundational discussion came up sparked by me asking innocently ;-) about what kind of information their priors represent. Actually I think that the Bayesian approach is indeed quite suitable for a number of problems that they have, particularly for combining all kinds of background information, for example distributions of certain measurements over lots of celestial objects when doing data analysis for a single one, and for expressing the uncertainty about certain parameters of interest that are very indirectly related to what can be observed with quite a number of sources of uncertainty in between. This work is not so much about testing/confirming simple research hypotheses, but rather about modelling how all kinds of influences and knowledge can be put together to get at values of general cosmological parameters such as density of dark matter, or on the other hand at more detailed knowledge and prediction ability about specific objects such as the sun. The Bayesian posteriors are used to quantify uncertainty, which seems problematic to the extent that the priors only partly are based on solid information and other parts are just chosen for convenience. However, it is unclear to me how much of this could be done as well with a frequentist analysis, still representing the existing background knowledge properly.

Christian: Thanks so much for such a comprehensive review. I have to study it more carefully. I’m very glad to hear that people, some of them, are focusing on selection effects and biases as responsible for non-reproducibility rather than simply scapegoating p-values (followed by a formal ‘cure’ worse than the disease).

I’d like to see Benjamini’s presentation.

The prevalence of errors that you worry about may not have to do with frequentist vs Bayesian, but many of the reactions and causes do–among other things.

See for example my slides on this post:

http://errorstatistics.com/phil6334-s14-mayo-and-spanos/phil-6334-slides/phil-6334-day-12-slides/