These days, there are so many dubious assertions about alleged correlations between two variables that an entire website: Spurious Correlation (Tyler Vigen) is devoted to exposing (and creating*) them! A classic problem is that the means of variables X and Y may both be trending in the order data are observed, invalidating the assumption that their means are constant. In my initial study with Aris Spanos on misspecification testing, the X and Y means were trending in much the same way I imagine a lot of the examples on this site are––like the one on the number of people who die by becoming tangled in their bedsheets and the per capita consumption of cheese in the U.S.
The annual data for 2000-2009 are: xt: per capita consumption of cheese (U.S.) : x = (29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8); yt: Number of people who died by becoming tangled in their bedsheets: y = (327, 456, 509, 497, 596, 573, 661, 741, 809, 717)
I asked Aris Spanos to have a look, and it took him no time to identify the main problem. He was good enough to write up a short note which I’ve pasted as slides.
Wilson E. Schmidt Professor of Economics
Department of Economics, Virginia Tech
*The site says that the server attempts to generate a new correlation every 60 seconds.
Having done a lot of applied I think both Spanos and the jokey correlation website miss the boat on a couple things here.
Many of the situations being called spurious are not “spurious” at all, but non-random correlations due to a common cause, just not the cause in question (usually a direct effect on one correlate to the other).
The “de-trend the mean” suggestion further muddies the waters. What’s the distinction between the variation that needs to be “de-trended” and one that doesn’t? It’s fundamentally a causal distinction.