WHIPPING BOYS AND WITCH HUNTERS

Posted on September 26, 2011 by Mayo

In an earlier post I alleged that frequentist hypotheses tests often serve as whipping boys, by which I meant “scapegoats”, for the well-known misuses, abuses, and flagrant misinterpretations of tests (both simple Fisherian significance tests and Neyman-Pearson tests, although in different ways). Checking the history of this term however, there is a certain disanalogy with at least the original meaning of a of “whipping boy,” namely, an innocent boy who was punished when a medieval prince misbehaved and was in need of discipline. It was thought that seeing an innocent companion, often a friend, beaten for his own transgressions would supply an effective way to ensure the prince would not repeat the same mistake. But significance tests floggings, rather than a tool for a humbled self-improvement and commitment to avoiding flagrant rule violations, has tended instead to yield declarations that it is the rules that are invalid! The violators are excused as not being able to help it! The situation is more akin to that of witch hunting, that in some places became an occupation in its own right.

Now some early literature, e.g., Morrison and Henkel’s Significance Test Controversy (1962), performed an important service over fifty years ago. They alerted social scientists to the fallacies of significance tests: misidentifying a statistically significant difference with one of substantive importance, interpreting insignificant results as evidence for the null hypothesis–especially problematic with insensitive tests, and the like. Chastising social scientists for applying significance tests in slavish and unthinking ways, contributors call attention to a cluster of pitfalls and fallacies of testing.

The volume describes research studies conducted for the sole purpose of revealing these flaws. Rosenthal and Gaito (1963) document how it is not rare for scientists to mistakenly regard a statistically significant difference, say at level .05, as indicating a greater discrepancy from the null when arising from a large sample size rather than a smaller sample size—even though a correct interpretation of tests indicates the reverse. By and large, these critics are not espousing a Bayesian line but rather see themselves as offering “reforms” e.g., supplementing simple significance tests with power (e.g., Jacob Cohen’s “power analytic movement), and most especially, replacing tests with confidence interval estimates of the size of discrepancy (from the null) indicated by the data. Of course, the use of power is central for (frequentist) Neyman-Pearson tests, and (frequentist) confidence interval estimation even has a duality with hypothesis tests!)

But rather than take a temporary job of pointing up some understandable fallacies in the use of newly adopted statistical tools by social scientific practitioners, or lead by example of right-headed statistical analyses, the New Reformers have seemed to settle into a permanent career of showing the same fallacies. Yes, they advocate “alternative” methods, e.g., “effect size” analysis, power analysis, confidence intervals, meta-analysis. But never having adequately unearthed the essential reasoning and rationale of significance tests—admittedly something that goes beyond many typical expositions—their supplements and reforms often betray the same confusions and pitfalls that underlie the methods they seek to supplement or replace! (I will give readers a chance to demonstrate this in later posts.)

We all reject the highly lampooned, recipe-like uses of significance tests; I and others insist on interpreting tests to reflect the extent of discrepancy indicated or not (e.g., Mayo 1981, writing my doctoral dissertation)! I never imagined that hypotheses tests (of all stripes) would continue to be flogged again and again, in the same ways!

Frustrated with the limited progress in psychology, an imagined malign conspiracy of significance tests is blamed: traditional reliance on statistical significance testing, we hear,

“has a debilitating effect on the general research effort to develop cumulative theoretical knowledge and understanding. However, it is also important to note that it destroys the usefulness of psychological research as a means for solving practical problems in society” (Schmidt 1996, 122)[i].

Lest enthusiasm for revisiting the same cluster of elementary fallacies of tests begin to lose steam, the threats of dangers posed become ever shriller: just as the witch is responsible for whatever ails a community, the significance tester is portrayed as so powerful as to be responsible for blocking scientific progress. In order to keep the gig alive, a certain level of breathless hysteria is common: “statistical significance is hurting people, indeed killing them” (Ziliak and McCloskey 2008, 186)[ii]; significance testers are members of a “cult” led by R.A. Fisher” whom they call “The Wasp”. To the question, “What if there were no Significance Tests,” as the title of one book inquires[iii], surely the implication is that once tests are extirpated, their research projects would bloom and thrive; so let us have Task Forces [iv] to keep reformers busy at journalistic reforms to banish the test once and for all!

Harlow, L., Mulaik, S., Steiger, J. (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

Hunter, J.E. (1997), “Needed: A Ban on the Significance Test,”, American Psychological Society 8:3-7.

Morrison, D. and Henkel, R. (eds.) (1970), The Significance Test Controversy, Aldine, Chicago.

MSERA (1998), Research in the Schools, 5(2) “Special Issue: Statistical Significance Testing,” Birmingham, Alabama.

Rosenthal, R. and Gaito, J. (1963), “The Interpretation of Levels of Significance by Psychologicl Researchers,” Journal of Psychology 55:33-38.

Ziliak, T. and McCloskey, D. (2008), The Cult of Statistical Significance, University of Michigan Press.

[i]Schmidt was the one Erich Lehmann wrote to me about, expressing great concern.

[ii] While setting themselves up as High Priest and Priestess of “reformers” their own nostroms reveal they fall into the same fallacy pointed up by Rosenthal and Gaito (among many others) nearly a half a century ago. That’s what should scare us!

[iii] In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

[iv] MSERA (1998): ‘Special Issue: Statistical Significance Testing,’ Research in the Schools, 5. See also Hunter (1997). The last I heard, they have not succeeded in their attempt at an all-out “test ban”. Interested readers might check the status of the effort, and report back.

Categories: Statistics | Tags: reformers, significance test controversies, significance tests | 8 Comments

8 thoughts on “WHIPPING BOYS AND WITCH HUNTERS”

September 26, 2011

Stanley

As an editor of the book, What if there were no significance tests?, I also had a chapter in that book in which I sought to
defend significance testing. So, I think I’m in your camp.
But I may have a different position from you on what I
call ‘nil hypothesis testing’, which is what the controversy
is about in psychology, although many don’t see the
difference between ‘null hypothesis testing’ and ‘nil
hypothesis testing’. The null hypothesis is the hypothesis
about the parameter that the researcher provisionally
will accept unless data so improbable and so different
from it is observed as to strain the bonds of belief to
the point of breaking (Fisher’s notion). The nil hypothesis
is the hypothesis that the parameter tested is 0. You
can have non-zero values for a null hypothesis. The
nil hypothesis by definition is that it is 0.
The flap in psychology concerns always testing a nil
hypothesis as your null hypothesis. And the way
psychologists do this is that they think there is a
relationship, although their knowledge is so limited that
they cannot specify a nonzero value for the degree of
the relationship. So, they think that what they must do is
show that the idea that there is no relationship is wrong:
hence they test a nil hypothesis. But they do not go on
to realize that whatever theory they may hold, rejecting
the nil hypothesis will be consistent with any theory
of the relationship that holds for some form of relationship.
In other words, there may be a plethora of theories that
prescribe different nonzero values for the relationship,
and all of these are supported by rejecting the nil hypothesis.
The nil-hypothesis method does not lead to scientific
advance. As samples get larger and larger, more and
more evidence piles up that it is not nil. But no one goes
on from there in psychology (their mathematical skills
are limited for the most part) to develop theories that
specific nonzero values for the relationship. That’s what
physicists would do.
And that is what Paul Meehl back in the 1960’s was
pointing out in articles, some in Philosophy of Science,
about the difference between the way physicists test
hypotheses with statistics and the way psychologists do.
The physicists may go through an initial abductive
phase in which they try out several quantitative theories
about some data, until they get one that seems to fit that
data. Then they deduce from the theory that fit best
a value for the parameter and use that value as the
null hypothesis value for a statistical test in some
new setting with new data not used in the formulation of the
hypothesis. (Recall C. S Peirce’s triad of abduction,
deduction, and induction). They may even use an
estimated value of the parameter from prior studies
as a constant and test if the new data differs significantly
from it.
That’s what was done in 1919 when Eddington and
Crommelin took observations from two locations of an eclipse of the sun and tested Einstein’s theory of relativity. The theory
concerned how gravity of a massive object would bend the light coming from the star so that it’s apparent position relative
to other stars would be deflected from its position otherwise.
Without going too far into the details, Einstein’s theory
predicted a deflection for the position of a star of 1″.74 at the edge of the sun with the amount decreasing inversely as the
distance from the sun’s center. Newtonian theory predicted
a deflection of just one half that value, 0″.87 at the edge of
the sun. Since the position of the star relative to other
stars could be compared from times when the sun was not
in the field of view (at other times during the year as the
earth revolved around the sun), the deflections could be
computed. The data came out in support of Einstein’s
prediction as opposed to the Newtonian prediction.
So, here the null hypothesis was these values 1″.74 versus
0″.87. And 1″.74 was within the confidence interval of two standard errors. All of this is in my chapter in reference cited
above for the book with editors Harlow, Mulaik, and Steiger.
Paradoxically psychologists think the problem is that
they should be using confidence intervals instead of point
estimates, that it is a question of power, sample size, etc..
But the problem is the nature of the hypotheses being tested.
The value tested should be some specific nonzero value
deduced from some theory appropriate to the data. Until
psychologists learn more mathematics and do as the
physicists do, they will make little progress.

As for Bayesian inference: I’m not a Bayesian, especially
not a subjective Bayesian. Statisticians have not read
Wittgenstein’s argument against the coherence of the
idea of a private language. But their idea of a subjective
prior suffers the same fate at the hands of this argument.
If a subjective prior is what it seems to say, it should be
private and unique to the statistician. But how does the
statistician know his own mind? If the prior is based on a
concept, the concept must be determined according to a rule,
and a rule, to be followed has to have a way to determine
if you are or are not following the rule. But how can the private concept be checked to see if it is used properly? How can the statistician determine that what he thinks is right
is really right according to the rule. Is whatever he says is
right going to be right? In that case he is not following a
rule. The concept of a private concept is incoherent.
The concept of a subjective prior is incoherent.

If the statistician arrives at his prior by some concerted
discussion with others, that is different. Apparently they
are following some kind of rules in doing this, and they
would have to be following some set of data they have
already at hand to do this. In fact, the least problematic
concept of a prior probability would be an empirical
prior. After all, the prior should be about something in
the world, and there are routine determinations that we
make about what is objectively in the world, meaning
there are ways to know when we are right independently
of just saying we are right.

I’ve noticed however, that statisticians regard things like
the uniform prior as cases of objective priors. That is a
rational prior. While you may be able to apply it
objectively, since it is a mathematical concept, it
is not the same as an objective concept about something
in the world. But the prior should be about what is in
the world and represent some objective determination
about the world. That’s what an empirical prior is.
The problem is that there may not be data available by
which to compute an empirical prior.

I’m also not sure whether many philosophers of science
have read Wittgenstein’s private language argument.
My impression reading Philosophy of Science over the
years is that most departments of philosophy of science
are filled with faculty who were trained by empiricist,
phenomenalist professors, and their students are getting
variants of that. To me the emphasis on Bayesian inference
in articles by philosophers of science these days suggests
theyare still under the sway of phenomenalist epistemology.
As are the stastisticians who have an excuse because they
are not philosophers.

Anyway, Deborah, I’ll be interested to see what you and
others do here.

Stan Mulaik

Reply

September 26, 2011

mayoerrorstat

Thanks for your comments Stan. I’d have too much to say to respond fully just now. I have discussed the general relativity tests case quite a lot. None of the early tests on eclipses sufficed to pass the Einstein prediction with any severity. That required radioastronomy later on.

Meehl was (rightly) pointing out that inferring a specific, substantive alternative H’, simply on the grounds that a null is rejected, fails to put H’ to a stringent test. I’m not sure why psychologists, as you say, are inclined not to specify alternatives. Anyway the same problem can happen with null hypotheses tests, inadequately interpreted.

Reply
December 27, 2012

Mayo

Stan: I’ve been going over some comments on my blog that I intended to come back to. I really like your comments, and amazingly, I agree with all of them. One thing though about subjective probability and W’s private language criticism. Can’t subjective Bayesians just say, sure they can never be wrong about what they believe at the time. It’s just a matter of introspection, or perhaps they ask themselves how much they would bet on the truth of a proposition. Why must they concur that they are incoherent in applying whatever favorite betting strategy is out there, to determine beliefs. I agree with E.S. Pearson who argued from the start that he felt it would be utterly impossible to settle on degrees of belief, that he would change his mind repeatedly, and thus he was loathe to put subjective priors at the foundation of statistics. I’ve never been too convinced by W’s arguments against private languages, but this is likely mostly a semantic issue (as W would want it to be).

Reply

September 26, 2011

Bill Jefferys

Mayo is exactly right about the Eddington (and later) tests of GR. Only radio astronomers had the necessary sensitivity to do it right.
Like Reply

Reply

December 27, 2012

Stanley Mulaik

I have little to add beyond what I said on p. 93 of What If There Were No Significance Tests? in my chapter with Raju and Harshman. I say that Eddington’s Type I error rate was higher than most current psychologist would accept. But his power was adequate for the small sample. But larger samples with more conventionally acceptable Type I error rates would satisfy more folks.

As for Wittgenstein’s private language argument applying to subjective probabilities, I still think I was right to apply this argument. Anything you say is right is going to be right. And that means we cannot say what is right here. (You need an external criterion to be able to say what is right). I’ve not pursued it further, but I think Bayesian statistics works from some numerical principle that would work with a whole range of numbers for the prior probabilities. So, you don’t have to be right in stating your prior probabilities. You just have to state some numbers within a certain interval, and they will all lead to useful results. That is speculation, however.

I’ve moved on to other issues. I got into the philosophy of causality in connection with structural equation modelling. See my chapter on causation in my Linear Causal Modeling with Structural Equations published by Chapman and Hall/CRC Press, Taylor and Francis Group published in 2009. By drawing upon the works of George Lakoff, I’ve taken up a kind of cognitive science of science approach, which is quite compatible with a neo-Kantian view.
Lakoff and Johnson argue that all abstract thought is conducted with metaphors, many of which are not recognized because the Romans thought of them first and they come to us in Latin words. Metaphors function like the a priori categories of Kant to provide structure for thought. Have you seen Lakoff and Johnson’s Philosophy in the Flesh? That is the cognitive science of philosophy, revealing the metaphors underlying major schools of philosophy through history. I also was intrigued with Kant’s categories of judgment and after many years realized they were based on various stages of synthesis. You need to understand synthesis and analysis in Kant’s work. It is central. C. S. Peirce to my delight recognized this over 100 years ago, with I only coming to the same understanding that synthesis occurs in threes 100 years later. First there is the distinct element. Only that is at this point the focus of your thought. This is a One. Then you have the thought of comparing that with some other One by showing them distinct, related or in contrast. That is a two. Then you create a thought that consists of all those twos in simultaneous interrelations, a synthesis of Twos. That is held in a single thought, a Three. So, you have inherence, the relation between object and attribute, a One. Then there is causation, how one attribute is conditioned or dependent upon another. That is a single thought. It is a Two.
And then there is the simultaneous causal interrelations between objects in community (we say today a system). A Three. You can make higher syntheses, but they all can be broken down into threes.

Right now my interests have turned to Modern Monetary Theory, and I think I can show that there is no national debt. It would be nice if the GOP/Tea Party, even President Obama realize that. But I have been retired for 12 years, so I can indulge these interests. It is frustrating that we likely will go off the fiscal cliff, realizing there was absolutely no real reason to do so. We are at the mercy of the flotsam of a sea of ignorance.

I don’t know if I have anything new on significance tests to add beyond what I wrote in Harlow, Mulaik and Steiger (1997).

Reply

September 26, 2011

Corey

Where’s the beef?

Your agenda for this blog probably includes offering a positive argument in support of your philosophical stance at some point. You clearly have a plan of action; if there’s some flexibility in the structure of the argument you plan to make, my preference (not that you’ve asked, but what the heck) is to see that positive argument sooner rather than later.

Reply

September 26, 2011

mayoerrorstat

Please remember that I have included links to everything I’ve written giving the positive arguments. Perhaps check the previous blog entry with the lucky 13 criticisms. There is a link to Mayo and Spanos. A key goal for this blog is to go some places that published papers do not quite go, in order to link up directly to people’s misconceptions/questions, etc. Having said that, this is entirely an experimental activity without a rigid plan. Events and reactions will influece entries. Thanks much for your interest.

Reply

December 27, 2012

stanislaus2

Whatever you do, you will be guided by some prior way of structuring your thought. And you should be aware of what that is.

Reply

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension. Cancel reply

WHIPPING BOYS AND WITCH HUNTERS

Post navigation

8 thoughts on “WHIPPING BOYS AND WITCH HUNTERS”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018. All Rights Reserved.

WHIPPING BOYS AND WITCH HUNTERS

Related

Post navigation

8 thoughts on “WHIPPING BOYS AND WITCH HUNTERS”

The Statistics Wars & Their Casualties

Blog links (references)

Reviews of Statistical Inference as Severe Testing (SIST)

Interviews & Debates on PhilStat (2020)

Interviews on PhilStat (2019)

LSE PH500 Research Seminar (May 21-June 25, 2020): Controversies in Phil Stat

Summer Seminar 2019 (article)

Top Posts & Pages

Conferences & Workshops

RMM Special Topic

Mayo & Spanos, Error Statistics

Follow Blog via Email

My Websites

Recent Posts: PhilStatWars

THE STATISTICS WARS AND THEIR CASUALTIES VIDEOS & SLIDES FROM SESSIONS 3 & 4

Final session: The Statistics Wars and Their Casualties: 8 December, Session 4

SCHEDULE: The Statistics Wars and Their Casualties: 1 Dec & 8 Dec: Sessions 3 & 4

WORKSHOP

The Statistics Wars and Their Casualties Videos & Slides from Sessions 1 & 2

LOG IN/OUT

Archives

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018 All Rights Reserved.

© Deborah G. Mayo, Error Statistics Philosophy, 2011-2018. All Rights Reserved.