How Correlations Can Fool You: The Hidden Dangers of Non-Linearity
In today’s post I’ll highlight how correlations can fool you if you blindly generate them without first plotting the data, particularly when you have non-linearities (read: curvy data). Regular readers will also note this is a sort of continuation of the Simpson’s Paradox posts here and here.
Because correlations are so easy to calculate and lend a gloss of sophistication in places where analytics are less common, it can be easy to misuse or misunderstand them. This is especially true for correlations in HR data.
R Stuff Preliminaries
If you want to play along at home (highly recommended) load these libraries before getting started.
library(wesanderson) #color palettes
library(dplyr) #tidyverse to make our lives better
library(fMultivar)
wp <- wes_palette("Darjeeling1") #specific color palettes
The Basics, Quickly
A correlation is simply a statistical relationship between two random variables.
However, the most common correlation measure, Pearson’s correlation coefficient, assumes that variables have a linear relationships (read: no curves) as in the following:
set.seed(42)
# par(mfrow = c(1,3))
for (i in c(.3, .6, .9)){
xy <- rnorm2d(1000, rho = i) %>%
as_tibble() %>%
rename(x = V1, y = V2)
plot(xy$x, xy$y, xlab = "x", ylab = "y", col = wp[5], pch = 19)
abline(lm(y ~ x, data = xy), col = wp[1], lwd = 3)
}
This is well known and it’s perfectly fine when you are operating in a linear or near-linear world. But it’s easy to forget this not-so-little detail and such forgetting can be consequential.
The examples below vividly remind us to always plot our data and not be blinded by the love of a number.
Curves: Correlation Bad
Example 1
This is a simple sine wave. A clear but decidedly non-linear relationship. The correlation here is 0 but the relationship obvious.
set.seed(2112)
x <- seq(0, 3*pi, .01)
y <- sin(x)
cor_xy <- cor(x, y) %>% round(2)
plot(x, y, type = "l", lwd = 3, col = wp[5])
abline(lm(y ~ x), col = wp[1], lwd = 3)
text(x = 1.5 * pi, y = .4, labels = paste("Correlation =", cor_xy))
Example 2
Here, what goes up must come down…but if we look only at the correlation, it’s like nothing happened at all.
x <- seq(0, 10, .1)
y <- ifelse(x <= 5, x, 10-x)
cor_xy <- cor(x, y) %>% round(2)
plot(x, y, type = "l", lwd = 3, col = wp[5])
abline(lm(y ~ x), col = wp[1], lwd = 3)
text(x = 5, y = 3, labels = paste("Correlation =", cor_xy))
Example 3
This is one that I borrowed from Nassim Nicholas Taleb here. He presents a more thorough (if varied and ad hominem-laden) treatment of the misuse of correlation. I could do without the tone and attacks in that piece but I thought this core demonstration was particularly relevant to human capital analytics.
In words, he shows that a non-linear relationship in the presence of noise can fool us into thinking we have a strong linear relationship.
For example, imagine you have a post-hire training assessment scored from 0-100 and a first-month performance measure between 0 and 150 (the units are arbitrary here). Let’s further assume that the true underlying relationship between the assessment and later first-month performance is non-linear such that performance increases as assessment scores increase from 0-50 and then totally flatten for those scoring above 50 (see figure below).
x <- 0:100
y <- ifelse(x <= 50, x, 50) + 30
cor_xy <- cor(x,y) %>% round(2)
plot(x, y, col = wp[5], ylim = c(0, 150),
xlab = "Test", ylab = "Performance", pch = 19)
abline(lm(y ~ x), col = wp[1], lwd = 3)
text(50, 5, labels = paste("Correlation = ", cor_xy))
We see a super strong correlation in this noise-free case, but even here it’s deceptive. What we really have is a perfect correlation for the first half and then no relationship for those in the second half. Zero.
Now we’ll progressively add noise (with a mean of zero) to the performance scores and then watch how the correlation decreases gradually but hides the underlying non-linear relationship.
Eventually we see a result that fools us into thinking we have a great linear relationship.
for (i in seq(0, 10, 2)) {
set.seed(42)
x <- rep(1:100,10)
y <- ifelse(x <= 50, x, 50) + rnorm(length(x), mean = 0, sd = 2*i) + 30
cor_xy <- cor(x,y) %>% round(2)
plot(x, y, col = wp[5], ylim = c(0, 150),
xlab = "Test", ylab = "Performance", pch = 19)
abline(lm(y ~ x), col = wp[1], lwd = 3)
text(50, 5, labels = paste("Correlation = ", cor_xy, "\nNoise sd= ", 2*i))
}
In the analytics and predictive modeling sphere, we would take a .57 correlation every day. The results here should give us pause about using linear methods in the presence of possible non-linear relationships.
In a future post, I will say more about the issues and alternatives to using linear methods like correlation. For now, I encourage you to take some time for independent research, see what you learn….and please, always plot your data.
Until then…
Related Resources
New to R and RStudio? Check out my video RStudio for the Total Beginner
Contact Us
- © 2023 HR Analytics 101
- Privacy Policy