knitr::opts_chunk$set(echo = TRUE) library(wesanderson) #color palettes library(dplyr) #tidyverse to make our lives better library(fMultivar) wp <- wes_palette("Darjeeling1") #specific color palettes Correlations are the simplest and most common statistical method for detecting a relationship between two variables. Because they are so easy to calculate and lend a gloss of sophistication in places where analytics are less common, it can be easy to misuse or misunderstand them. In today’s post I’ll highlight a few different ways that correlations can mislead us if we blindly generate them without first plotting our data. Regular readers will also note this is a sort of continuation of the Simpson’s Paradox posts. ## The Basics, Quickly A correlation is simply a statistical relationship between two random variables. However, the most common correlation measure, Pearson’s correlation coefficient, assumes that variables have a linear relationships (read: no curves) as in the following: set.seed(42) # par(mfrow = c(1,3)) for (i in c(.3, .6, .9)){ xy <- rnorm2d(1000, rho = i) %>% as_tibble() %>% rename(x = V1, y = V2) plot(xy$x, xy\$y, xlab = "x", ylab = "y", col = wp, pch = 19)
abline(lm(y ~ x, data = xy), col = wp, lwd = 3)
}   This is well known and it’s perfectly fine when you are operating in a linear or near-linear world. But it’s easy to forget this not-so-little detail and such forgetting can be consequential.

The examples below vividly remind us to always plot our data and not be blinded by the love of a number.

## Curves: Correlation Bad

### Example 1

This is a simple sine wave. A clear but decidedly non-linear relationship. The correlation here is 0 but the relationship obvious.

set.seed(2112)
x <- seq(0, 3*pi, .01)
y <- sin(x)
cor_xy <- cor(x, y) %>% round(2)
plot(x, y, type = "l", lwd = 3, col = wp)
abline(lm(y ~ x), col = wp, lwd = 3)
text(x = 1.5 * pi, y = .4, labels = paste("Correlation =", cor_xy)) ### Example 2

Here, what goes up must come down…but if we look only at the correlation, it’s like nothing happened at all.

x <- seq(0, 10, .1)
y <- ifelse(x <= 5, x, 10-x)
cor_xy <- cor(x, y) %>% round(2)
plot(x, y, type = "l", lwd = 3, col = wp)
abline(lm(y ~ x), col = wp, lwd = 3)
text(x = 5, y = 3, labels = paste("Correlation =", cor_xy)) ### Example 3

This is one that I borrowed from Nassim Nicholas Taleb here. He presents a more thorough (if varied and ad hominem-laden) treatment of the misuse of correlation. I could do without the tone and attacks in that piece but I thought this core demonstration was particularly relevant to human capital analytics.

In words, he shows that a non-linear relationship in the presence of noise can fool us into thinking we have a strong linear relationship.

For example, imagine you have a post-hire training assessment scored from 0-100 and a first-month performance measure between 0 and 150 (the units are arbitrary here). Let’s further assume that the true underlying relationship between the assessment and later first-month performance is non-linear such that performance increases as assessment scores increase from 0-50 and then totally flatten for those scoring above 50 (see figure below).

x <- 0:100
y <- ifelse(x <= 50, x, 50) + 30
cor_xy <- cor(x,y) %>% round(2)
plot(x, y, col = wp, ylim = c(0, 150),
xlab = "Test", ylab = "Performance", pch = 19)
abline(lm(y ~ x), col = wp, lwd = 3)
text(50, 5, labels = paste("Correlation = ", cor_xy)) We see a super strong correlation in this noise-free case, but even here it’s deceptive. What we really have is a perfect correlation for the first half and then no relationship for those in the second half. Zero.

Now we’ll progressively add noise (with a mean of zero) to the performance scores and then watch how the correlation decreases gradually but hides the underlying non-linear relationship.

Eventually we see a result that fools us into thinking we have a great linear relationship.

for (i in seq(0, 10, 2)) {
set.seed(42)
x <- rep(1:100,10)
y <- ifelse(x <= 50, x, 50) + rnorm(length(x), mean = 0, sd = 2*i) + 30
cor_xy <- cor(x,y) %>% round(2)
plot(x, y, col = wp, ylim = c(0, 150),
xlab = "Test", ylab = "Performance", pch = 19)
abline(lm(y ~ x), col = wp, lwd = 3)
text(50, 5, labels = paste("Correlation = ", cor_xy, "\nNoise sd= ", 2*i))

}      In the analytics and predictive modeling sphere, we would take a .57 correlation every day. The results here should give us pause about using linear methods in the presence of possible non-linear relationships.

In a future post, I will say more about the issues and alternatives to using linear methods like correlation. For now, I encourage you to take some time for independent research, see what you learn….and please, always plot your data.

Until then…