Monday, April 16, is Jerzy Neyman’s birthday, but this post is not about Neyman (that comes later, I hope). But in thinking of Neyman, I’m reminded of Erich Lehmann, Neyman’s first student, and a promissory note I gave in a post on September 15, 2011. I wrote:

“One day (in 1997), I received a bulging, six-page, handwritten letter from him in tiny, extremely neat scrawl (and many more after that). …. I remember it contained two especially noteworthy pieces of information, one intriguing, the other quite surprising. The intriguing one (

I’ll come back to the surprising one another time, if reminded) was this: He told me he was sitting in a very large room at an ASA meeting where they were shutting down the conference book display (or maybe they were setting it up), and on a very long, dark table sat just one book, all alone, shiny red. He said he wondered if it might be of interest to him! So he walked up to it…. It turned out to be myError and the Growth of Experimental Knowledge(1996, Chicago), which he reviewed soon after.”

But what about the “surprising one” that I was to come back to “if reminded”? (yes, one person did remind me last month). The surprising one is that Lehmann’s letter—this is his first letter to me– asked me to please read a paper by Frank Schmidt to appear in his wife Juliet Shaffer’s new (at the time) journal, *Psychological Methods*, as he wondered if I had any ideas as to what may be done to answer such criticisms of frequentist tests! But, clearly, few people could have been in a better position than Lehmann to “do something about” these arguments …hence my surprise. But I think he was reluctant….

Lehmann actually hand wrote some of the quotes in Schmidt’s paper (no links!), such as the one in my September 15,blog.

“Reliance on statistical significance testing….has a debilitating effect on the general research effort to develop cumulative theoretical knowledge and understanding. However, it is also important to note that it destroys the usefulness of psychological research as a means for solving practical problems in society” (Schmidt 1996, 122)[i].

Schmidt is one of the leaders of what I dubbed the “New Reformers”. Kevin Carlson of the Department of Management at Virginia Tech informs me that Dr. Schmidt will speak here today, Friday, April 13.

The main reason, in this quote, that Schmidt claims “reliance on significance testing retards the growth of cumulative research knowledge” (115) is that he finds tests on questions of social policy (e.g., Do government-sponsored job-training programs work?), have very low power, and they yield conflicting results. I have no reason to question this. Whether it is a good idea to “reconcile” these studies, perhaps to find effects that were overlooked, using the meta-analytic techniques Schmidt recommends is unclear.

Still, I agree with Schmidt when it comes to the importance of power considerations, and am glad that he and Cohen have long argued for power analysis. His remark that “one reason why psychologists for so long gave virtually no attention to the question of statistical power” may be traced to Fisher’s influence, is likely correct. “The concept of statistical power does not exist in Fisherian statistics. In Fisherian statistics, the focus of attention is solely on the null hypothesis. No alternative hypothesis is introduced. “(122) Then again, we have seem the importance Fisher gave to sensitive tests. [ii]

Finally, I think the use of simple significance tests, properly interpreted, have a valid role to play–indeed, even Bayesians rely on variants of significance tests to scrutinize their models these days (e.g., Gelman, 2011, [iii]). Moreover, rather than needing to replace tests with corresponding confidence intervals, as many New Reformers advise, it seems to me that when significance tests are supplemented with assessments of the discrepancies that have and have not passed severe tests, they direct a more nuanced use and interpretation of confidence intervals.

I look forward to meeting and hearing Schmidt on (“Are True Score and Construct Scores the Same?”).

- Dr. Frank Schmidt, Pamplin 1045,

2:30 p.m.

April 13, 2012 Va Tech

[i] Frank L. Schmidt (1996) ): “Statistical Significance Testing and Culmulative Knowledge in Psychology: Implications for Training of Researchers”, *Psychological Methods* Vol. 1(2): 115-129.

[ii] My own view is that Fisher’s complaints about power and the alternative hypothesis in “the triad” had more to do with personality conflicts that had erupted. Remember the 5 year plans, and Pearson’s “heresy”and all that? http://errorstatistics.com/2012/02/11/fisher-silver-jubliee/

[iii]Andrew Gelman (2011). “Induction and Deduction in Bayesian Data Analysis” in *Rationality, Markets and Morals (RMM): Studies at the Intersection of Philosophy and Economics*, (M. Albert, H. Lkiemt and B. Lahno eds), An open access journal published by the Frankfurt School: Verlag. Volume 2: 67-78.

Lehmann review of EGEK: http://www.phil.vt.edu/dmayo/pubs/EGEKLehmann_review.pdf

I didn’t have the time to read Schmidt’s paper in full but I am a bit familiar with this discussion in psychology.

I think that this has more to do with the culture of teaching, using and communicating significance tests and their results than with the method itself.

Many psychologists don’t have a deep understanding of statistical modelling but statistical tests are taught and perceived as cornerstones of objectivity and so there is pressure to use them for whatever you do. I don’t know whether this still is the case but demonstrating statistical significance was for quite some time a publication criterion in several major psychological journals.

Therefore there is a lot of routine testing going on without properly looking at model assumptions, problems from multiple testing and the like, and results are often misinterpreted (“we’re 95% sure that the null hypothesis is true”). One can therefore make a valid case that there is a problem with a culture that forces people into doing this regardless of whether it is appropriate. But that wouldn’t stop anyone from saying that tests can still be reasonable where it is properly understood what they can deliver and what they can’t.

I should add that I don’t want to blame psychologists specifically for misusing statistics. Significance testing seems to have some “cultural gravity”. Even my own students, when writing reports, tend to focus much too exclusively of what they get from significance tests compared to commenting on effect sizes etc. despite my continuing efforts, and are longing for “objective” p-values for whatever they have to make a decision about.