How to Avoid Aggregation Errors and Simpson’s Paradox in HR Analytics: Part 1
Recently I’ve been reading Scott Page’s masterful book “The Model Thinker”.
It’s full of powerful insights so in this and future posts, I will be taking things I’ve learned there and showing how they apply in HR Analytics.
Today, it’s Simpson’s Paradox and the dangers lurking in unexamined aggregations.
Simpson’s Paradox occurs when a trend appears in two or more groups analyzed independently but then disappears or reverses when analyzing combined groups.
In plain English, “Combine the data and you get one trend, split the data and get another.”
Let’s look at an example.
### Use this R code to create the same ### simulated data used here set.seed(42) area <- rep(c("retail", "commercial"), each = 100) exp_r <- runif(n = 100, 1, 5) exp_c <- runif(n = 100, 4,8) p <- data.frame(area = area, exp = c(exp_r, exp_c)) p$penalty <- ifelse(p$area == "retail", -1, 1) p$error <- rnorm(200, 3000, 800) p$sales <- 800*p$exp - 3000*p$penalty + p$error
Experience = Bad?
Suppose I’ve just been appointed Director of Sales for an international fasteners company.
I think experience matters when it comes to sales success so I want to use that as part of my hiring criteria, but I have some data too so I decide to look at the numbers first.
I start with a simple correlation to assess the relationship between years of previous sales experience and first-year sales for new hires at our company. The results is…. -0.38!?!
So more previous sales experience is associated with fewer sales in the first year? Really?
I’m skeptical. Let’s turn to a boxplot instead.
I do a median split on experience and compare the first-year sales numbers for those below and above the median years of previous sales experience.
rush_gr <- wes_palette(n = 5, name = "Rushmore1") ggplot(p, aes(x = exp>median(exp), y = sales)) + geom_boxplot(fill = rush_gr)
Same thing! Those with experience above the median actually have lower first-year sales than those below the median.
If we stopped there, we might conclude we should stop hiring based on previous sales experience or maybe even focus on recruiting people with LESS sales experience.
Because that’s what the data say and we just need to follow the data…right?
Plot Your Data
Remember the first rule of analytics: Plot your data.
Lo and behold, when we plot our data, we get some clarity on our initial results.
ggplot(p, aes(x = exp, y = sales)) + geom_point() + ggtitle("Sales by Years of Experience") +xlab("Years of Experience") + ylab("Sales")
Plotting makes it clear that we have two distinct groups in our data.
For the first group on the upper left, we see a clear upward slope, with increasing sales experience associated with increasing first-year sales.
But the same thing holds true for our second group on the lower right…sales experience is also positively related to sales.
In sum, considering each group separately, we see a clear POSITIVE relationship between previous experience and first-year sales. But lump it all together and we get a NEGATIVE relationship.
This is the essence of Simpson’s Paradox: Combine your data and you get one thing, split your data and get another.
All Is Revealed
Let’s dive a bit deeper and use some regressions to see what is happening.
First, the combined data with a single regression line.
ggplot(p, aes(x = exp, y = sales)) + geom_point() + geom_smooth(method = "lm", color = "black", se = F) + ggtitle("Sales by Years of Experience") +xlab("Years of Experience") + ylab("Sales")
The regression line has a negative slope, consistent with our earlier negative correlation.
But looking at the plot, we see we have some confound in which the people in that generally more experienced group also have lower first-year sales.
Our simple regression line has a negative slope because it’s being swamped by this group effect; those on the lower right have both more experience but also lower sales, leading to a negative slope and a misrepresentation of the otherwise visually obvious, positive relationship between experience and sales.
This is not the regression’s fault. It’s our fault because looking at people based ONLY on experience masks what is really happening in the data.
We get a totally different outcome analyzing these two groups separately.
temp_color <- wes_palette(n = 5, name = "Rushmore1") ggplot(p, aes(x = exp, y = sales)) + geom_point(aes(color = area)) + geom_smooth(method = "lm", color = temp_color, linetype = 2) + geom_smooth(method = "lm", aes(color = area)) + scale_color_manual(values=temp_color[c(3,5)]) + ggtitle("Sales by Years of Experience") +xlab("Years of Experience") + ylab("Sales")
Adding a separate regression line for each of the two groups, we can now clearly see the positive slopes for experience within each cluster; compare that with the dotted combined regression line.
Under the hood, I created this difference in the present test data with a third variable called “area”, arbitrarily assigning their previous experience to either commercial or retail sales.
With this simulated data and the third variable now identified, we would conclude the following:
- Experience is positively associated with first-sales for both groups
- Those from retail sales fair better than those from commercial in first-year sales.
For good measure, I’ll throw this area variable into a new regression model to see it’s impact on model fit, noting of course that dealing fully with groups in regression is beyond our present scope.
m2 <- lm(sales ~ exp + area, data = p) summary(m2)
## ## Call: ## lm(formula = sales ~ exp + area, data = p) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2146.38 -520.46 -0.81 478.17 2236.81 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -28.13 295.03 -0.095 0.924 ## exp 803.26 46.84 17.151 <2e-16 *** ## arearetail 5948.04 177.25 33.558 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 771.7 on 197 degrees of freedom ## Multiple R-squared: 0.873, Adjusted R-squared: 0.8717 ## F-statistic: 676.9 on 2 and 197 DF, p-value: < 2.2e-16
Simpson’s Paradox is a specific case of the third variable problem in which splitting your data can show one clear trend while combining your data can show you a completely different trend.
Today we saw how one variable (experience) was confounded with group membership (sales area). When we combined our data, ignored groups, and blindly looked at correlation and a regression model, we got one result.
When we plotted our data, observed the presence of a grouping variable, and separated our analyses into groups, we got the opposite effect.
The big lessons?
- Plot your data
- Consider third variables: “What else might be true?”
- Be on the lookout for Simpson’s Paradox
In Part 2 of our series on Simpson’s Paradox coming up, we’ll see how huge interactions can hide as non-effects and how a seemingly clear performance advantage in two different time periods can be reversed by combining data.
- Paul Vanderlaken’s great post on Simpson’s Paradox. I didn’t see this one until I after I started writing this post but I wanted to mention it because it’s really good. Paul has tons of great stuff so be sure check him out.
- Kievit et al’s thorough but more academic treatment of Simpson’s Paradox
- © 2021 HR Analytics 101