Statistics and ESP research (Diaconis)

In the early ‘80s, fresh out of graduate school, I persuaded Persi Diaconis, Jack Good, and Patrick Suppes to participate in a session I wanted to organize on ESP and statistics. It seems remarkable to me now—not only that they agreed to participate*, but the extent that PSI research was taken seriously at the time. It wasn’t much later that all the recurring errors and loopholes, and the persistent cheating self-delusion —despite earnest attempts to trigger and analyze the phenomena—would lead many nearly everyone to label PSI research a “degenerating programme” (in the Popperian-Lakatosian sense).

(Though I’d have to check names and dates, I seem to recall that the last straw was when some of the Stanford researchers were found guilty of (unconscious) fraud. Jack Good continued to be interested in the area, but less so, I think. I do not know about the others.)

It is interesting to see how background information enters into inquiry here. So, even though it’s late on a Saturday night, here’s a snippet from one of the papers that caught my interest in graduate school: Diaconis’s (1978) “Statistical Problems in ESP Research“, in Science, along with some critical “letters”

Summary. In search of repeatable ESP experiments, modern investigators are using more complex targets, richer and freer responses, feedback, and more naturalistic conditions. This makes tractable statistical models less applicable. Moreover, controls often are so loose that no valid statistical analysis is possible. Some common problems are multiple end points, subject cheating, and unconscious sensory cueing. Unfortunately, such problems are hard to recognize from published records of the experiments in which they occur; rather, these problems are often uncovered by reports of independent skilled observers who were present during the experiment. This suggests that magicians and psychologists be regularly used as observers. New statistical ideas have been developed for some of the new experiments. For example, many modern ESP studies provide subjects with feedback—partial information about previous guesses—to reward the subjects for correct guesses in hope of inducing ESP learning. Some feedback experiments can be analyzed with the use of skill-scoring, a statistical procedure that depends on the information available and the way the guessing subject uses this information. (p. 131)

Is modern parapsychological research worthy of serious consideration? The volume of literature by reputable scientists, the persistent interest of students, and the government’s funding of ESP projects make it difficult to evade this question. Over the past 10 years, in the capacity of statistician and professional magician, I have had personal contact with more than a dozen paranormal experiments. My background enourages a thorough skepticism, but I also find it useful to recall that skeptics make mistakes. …

Critics of ESP must acknowledge the possibility of missing a real phenomenon because of the difficulty of designing a suitable experiment. However, the characteristics which lead many to be dubious about claims for ESP—its sporadic appearance, its need for a friendly environment, and its common association with fraud—require of the most sympathetic analyst not only skill in the analysis of nonstandard types of experimental design but appreciation of the differences between a sympathetic environment with flexible study design and experimentation which is simply careless or so structured as to be impossible to evaluate.

In this article I use examples to indicate the problems associated with the generally informal methods of design and evaluation of ESP experiments—in particular, the problems of multiple end points and subject cheating. I then review some of the commentaries of outstanding statisticians on the problems of evaluation. Finally, as an instance of using new analytic methods for non-standard experiments, I give examples of some new statistical techniques that permit appropriate evaluation of studies that allow instant feedback of information to the subject after each trial, an entirely legitimate device used to facilitate whatever learning process may be involved. (p. 131)

Statisticians and ESP (p. 133)

The only widely respected evidence for paranormal phenomena is statistical. Classical statistical tests are reported in each of the published studies described above. Most often these tests are ‘highly statistically significant.’ This only implies that the results are improbable under simple chance models. In complex, badly controlled experiments simply chance models cannot be seriously considered as tenable explanations; hence, rejection of such models is not of particular interest. For example, the high significance claimed for the famous Zenith Radio experiment is largely a statistical artifact (18).  Listeners were invited to mail in their guesses on a random sequence of playing cards. The proportion of correct guesses was highly significant when calculations were based on the assumption of random guessing on the part of each listener. It is well known (19) that the distribution of sequences produced by human subjects is far from random, and hence the crucial hypothesis of independence fails in this situation. More sophisticated analysis of the Zenith results gives no cause for surprise.

In well-run experiments, statistics can aid in the design and final analysis. The idea of deliberately introducing external, well-controlled randomization in investigation of paranormal phenomena seems due to Richet (20 ) and Edgeworth (21). Later, Wilks (22) wrote a survey article on reasonable statistical procedures for analyzing paranormal experiments popular at the time. Fisher developed new statistical methods that allow credit for ‘close’ guesses in card-guessing experiments (23). Good (24) continues to suggest new experiments and explanations for ESP. The parascience community, well aware of the importance of statistical tools, has solved numerous statistical riddles in its own literature.  Any of the three best known parascience journals is a source of a number of good surveys and discussions of inferential problems (25).

For the full article and citations: Statistical Problems in ESP Research

The grounds for the growing skepticism of the period were based on the obstacles standing in the way of valid testing of the variety of different ESP hypotheses. Examples include: multiple end points, subject cheating, unconscious cueing, gaps between published records and actual experimental protocols, poorly designed, badly run, and inappropriately analyzed experiments. “Even if there had not been subject cheating, the experiments described above would be useless because they were out of control. The confusing and erratic experimental conditions I have described are typical of every test of paranormal phenomena I have witnessed”. (Diaconis, p. 133)

My takeaway message: The background knowledge here, insofar as it is relevant for inquiry, consists of very specific problems as well as specific recommendations/requirements for experimental designs. Communicating and using the background information in inquiry also involves describing specific protocols, checks, and stipulations for any future experimental demonstrations to pass muster.

You may be interested to read some critical “letters” by  Tart, and Puthoff and Targ, with an author response.

*There’s more: it was part of a ‘popular culture society’ meeting!

See subsequent posts:

Categories: philosophy of science, Philosophy of Statistics, Statistics

Post navigation

13 thoughts on “Statistics and ESP research (Diaconis)

  1. E. Berk

    I believe is the parapsychological association is still a member of the AAAS. If so, why?

    • Berk: Sailor took the opportunity to check, and it does seem the AAAS includes parapsychology. thanks.

  2. Some people have written expressing their radical rejection of ESP, and wondering why I bring it up. The answer is that I keep being asked a question that is regarded as highly critical for error statistical tests, that refers to tests on ESP. It goes like this:

    If we are prepared to take a statistically significant proportion of successes (greater than .5) in n Binomial trials as grounds for inferring a real (better than chance) effect (perhaps of two teaching methods) but not as grounds for inferring Uri’s ESP then aren’t we implicitly invoking a difference in prior probabilities? The answer is No.
    For one thing, merely finding evidence of a non-chance effect is at a different “level” from a subsequent question about the explanation or cause of a non-chance effect. The severity analysis for the respective claims makes this explicit. But my real point is that the background information that is relevant for inquiry into the phenomena is given in terms of the series of problems, flaws and fallacies, as described in this post.
    If given the choice to have the background summed up in terms of this series of problems and protocols (say in 1980), or only in a degree of belief in the reality of ESP, which would be regarded as a more adequate sum-up?
    Which would communicate the relevant information for conducting and analyzing research? (and of course I would make the analogous argument for inquiry more generally). I think it is obvious.

  3. Corey

    Mayo wrote: “If given the choice to have the background summed up in terms of this series of problems and protocols (say in 1980), or only in a degree of belief in the reality of ESP, which would be regarded as a more adequate sum-up?”

    This is a false dichotomy — nothing in the Bayesian approach requires the background to be summed up *only* in a degree of belief in the reality of ESP. The Bayesian (Jaynesian) approach requires plausibility assessments for all hypotheses* that could explain the data, including experimental flaws.

    * hypotheses that imply the same data distribution get lumped together, see sections 8.10.1 and 8.11 of E.T. Jaynes’s PTLOS.

    • Corey: Thanks for this. Since I noticed this earlier, but have not had a chance to respond til now, I’m going to get something out quickly to you (so I hope it’s readable).
      First off, no one who raises the criticism alleges that I would not be able to take into account the flaws and fallacies of existing studies in either the design or interpretation of the results they describe. They say I am missing, or am in need of, a prior probability in ESP.

      But let’s get to your second point about introducing priors for each of the different flaws or hypothesized explanations. I take it they would be based on how often, how typical, or perhaps how easy certain flaws are? and/or would the specific experiment under analysis be the basis for the priors, taking into account the type of experiment (e.g., cards, remote viewing, etc.) and perhaps the subject being tested (if there is background on him or her)?

      Once collected, would one do a formal Bayesian computation of the probability the data x arose through ESP as opposed to these other things? Why not just use the background of known flaws and tricks, solve the Duhemian problem of explaining the results x in the case at hand (or conclude one cannot explain them for now). Granted, one might separately wish to arrive at a probability of the occurrence of each experimental flaw (e.g., 60% of remote viewing studies in the 70s use cueing), but to explain a given set of experimental PSI data, that would seem rather circuitous at best, wouldn’t it? OK, more in a later post, hopefully.

      • Corey

        Mayo wrote: I take it they would be based on how often, how typical, or perhaps how easy certain flaws are?

        They would be based on how plausible the flaws are. All of the kinds of information you mention could inform the prior distribution. I wish I knew of a formal principle for encoding that kind of prior knowledge into a prior distribution. It seems the best we can do at this stage is to lay out the prior knowledge, choose one distribution that seems to encode it (better yet, more than one, and perform a sensitivity analysis), and proceed.

        One wrinkle to note is that when people are performing inference about each others’ actions, incentives are important, and game theory can’t be avoided.

        Mayo wrote: Once collected, would one do a formal Bayesian computation of the probability the data x arose through ESP as opposed to these other things? Why not just use the background of known flaws and tricks, etc. to solve the Duhemian problem of explaining the results x in the case at hand (or conclude one cannot explain them for now).

        I view the “use the background of known flaws and tricks to [and so forth]”, however construed, as an approximation to a formal Bayesian computation — probably quite a good one, too.

        • Corey: Yes, if all flesh is grass then kings and queens (being made of flesh) are also grass. Or however the saying goes. As I’m rerouted to an unchartered airport island, I can’t check it, but you get the point I hope.

  4. Eileen

    This is interesting. Bayesians claim an advantage to quantifying prior information as formal probabilities, yet when I look at all the (very) specific background information that you list–all the problems with “multiple end points, subject cheating”, etc –none of this would be transmitted by announcing, “ I do not believe in ESP” or I give it a (very) low prior. It strikes me that at lot of information is being lost (or buried) in the prior. Corey’s Jaynesian approach above sounds like they could get lumped (buried) together into a big experimental flaw probability/plausibility assessment? (I am not familiar with this Jaynesian approach. Still I think I would be more concerned about being able to test whether or not a specific error occurred, not whether someone thinks it (with probability x) is a plausible explanation for the data.

    • Thanks Eileen. I agree and your comment is in sync with the one i just posted; I guess you must have been a student of mine at some point (or maybe in a past life)!

  5. guest

    If you haven’t seen it already, the work of Jessica Utts may be of interest.

    Bayesian analysis of this problem need not involve working out the probability of the data under all competing individual explanations for the data. Instead, we can put a prior on e.g.;

    * the size of the true effect of ESP
    * the size of the difference between the true effect of ESP and the parameter learned about, with the flawed study design we have.

    (There are a lot of ways to parameterize this problem, this is just one.) The prior on the “difference” term reflects just how bad You believe the study design to be. Neither aspect of the prior need have a spike at zero.

    The posterior would involve both quantities. It’s quite easy for the posterior on the true effect to not get updated much, if at all. If the prior doesn’t reflect some design flaw that actually matters – because no-one thought of it – it’s possible for the Bayesian analysis to be misleading. But so would any non-Bayesian who’d missed the same design flaw.

    • Thanks Guest: I think this gets precisely to my reason for saying, not only do we not need/want to assign priors to all design flaws, ESP effects, etc. to analyze a given study, but that such an attempt falls short of capturing what the (hard won) background information really consists of. And whose priors for the various flaws that might invalidate THIS experiment do you use: Charles Tart, Diaconis? You? Anyway, it seems to me that having to assign priors to the ways THIS experiment can be wrong (even as a catchall) is to foist a method on a situation where a far more direct analysis exists: did he switch the film in getting these pictures? Let’s see what happens when he’s not allowed near it, etc. Moreover, you’d lose the force of the knock down evidence of why this result provides no evidence for ESP. They’d likely still be at it.

      • guest

        “Assigning priors to design flaws” isn’t really what’s being done. Instead, the prior is assigned to the difference in e.g. average predictive accuracy for cards in the experiment at hand (say where cards were not shuffled between draws) and average predictive accuracy for cards in a perfect experiment – that reflects effects due to ESP alone. (“average” here indicates averaging over the people in the study, doing the predicting). If we know enough about the experiment that was done, a quite tight prior on the difference could be set. If we have no idea how crappy the experiment is, any prior reflecting this will lead to a posterior where we learn nothing about the ESP effect.

        Also, I don’t think one does lose the “knock down”. One can still get a posterior indicating that we’ve learned nothing about ESP effect.

        If you want to get away from ESP, similar arguments are applied in measurement error problems, using priors on how crappy our data is at telling us about effects we actually want to know about – there’s a huge literature on it, Bayesian and otherwise.

    • Guest: By the way, thanks for the reference to Jessica Utts. It will be interesting to see what’s being done on stat and psi these days, when i can manage it.

Blog at