How to Avoid Aggregation Errors and Simpson’s Paradox in HR Analytics: Part 1
Recently I’ve been reading Scott Page’s masterful book “The Model Thinker”.
It’s full of powerful insights so in this and future posts, I will be taking things I’ve learned there and showing how they apply in HR Analytics.
Today, it’s Simpson’s Paradox and the dangers lurking in unexamined aggregations.
Simpson’s Paradox
Simpson’s Paradox occurs when a trend appears in two or more groups analyzed independently but then disappears or reverses when analyzing combined groups.
In plain English, “Combine the data and you get one trend, split the data and get another.”
Let’s look at an example.
### Use this R code to create the same
### simulated data used here
set.seed(42)
area <- rep(c("retail", "commercial"), each = 100)
exp_r <- runif(n = 100, 1, 5)
exp_c <- runif(n = 100, 4,8)
p <- data.frame(area = area, exp = c(exp_r, exp_c))
p$penalty <- ifelse(p$area == "retail", -1, 1)
p$error <- rnorm(200, 3000, 800)
p$sales <- 800*p$exp - 3000*p$penalty + p$error
Experience = Bad?
Suppose I’ve just been appointed Director of Sales for an international fasteners company.
I think experience matters when it comes to sales success so I want to use that as part of my hiring criteria, but I have some data too so I decide to look at the numbers first.
I start with a simple correlation to assess the relationship between years of previous sales experience and first-year sales for new hires at our company. The results is…. -0.38!?!
cor(p$exp, p$sales)
So more previous sales experience is associated with fewer sales in the first year? Really?
I’m skeptical. Let’s turn to a boxplot instead.
I do a median split on experience and compare the first-year sales numbers for those below and above the median years of previous sales experience.
rush_gr <- wes_palette(n = 5, name = "Rushmore1")[3]
ggplot(p, aes(x = exp>median(exp), y = sales)) + geom_boxplot(fill = rush_gr)
Same thing! Those with experience above the median actually have lower first-year sales than those below the median.
If we stopped there, we might conclude we should stop hiring based on previous sales experience or maybe even focus on recruiting people with LESS sales experience.
Because that’s what the data say and we just need to follow the data…right?
No.
Plot Your Data
Remember the first rule of analytics: Plot your data.
Lo and behold, when we plot our data, we get some clarity on our initial results.
ggplot(p, aes(x = exp, y = sales)) + geom_point() +
ggtitle("Sales by Years of Experience") +xlab("Years of Experience") +
ylab("Sales")
Plotting makes it clear that we have two distinct groups in our data.
For the first group on the upper left, we see a clear upward slope, with increasing sales experience associated with increasing first-year sales.
But the same thing holds true for our second group on the lower right…sales experience is also positively related to sales.
In sum, considering each group separately, we see a clear POSITIVE relationship between previous experience and first-year sales. But lump it all together and we get a NEGATIVE relationship.
This is the essence of Simpson’s Paradox: Combine your data and you get one thing, split your data and get another.
All Is Revealed
Let’s dive a bit deeper and use some regressions to see what is happening.
First, the combined data with a single regression line.
ggplot(p, aes(x = exp, y = sales)) +
geom_point() +
geom_smooth(method = "lm", color = "black", se = F) +
ggtitle("Sales by Years of Experience") +xlab("Years of Experience") +
ylab("Sales")
The regression line has a negative slope, consistent with our earlier negative correlation.
But looking at the plot, we see we have some confound in which the people in that generally more experienced group also have lower first-year sales.
Our simple regression line has a negative slope because it’s being swamped by this group effect; those on the lower right have both more experience but also lower sales, leading to a negative slope and a misrepresentation of the otherwise visually obvious, positive relationship between experience and sales.
This is not the regression’s fault. It’s our fault because looking at people based ONLY on experience masks what is really happening in the data.
We get a totally different outcome analyzing these two groups separately.
temp_color <- wes_palette(n = 5, name = "Rushmore1")
ggplot(p, aes(x = exp, y = sales)) +
geom_point(aes(color = area)) +
geom_smooth(method = "lm", color = temp_color[4], linetype = 2) +
geom_smooth(method = "lm", aes(color = area)) +
scale_color_manual(values=temp_color[c(3,5)]) +
ggtitle("Sales by Years of Experience") +xlab("Years of Experience") +
ylab("Sales")