How to Avoid Aggregation Errors and Simpson’s Paradox In HR Analytics: Part 2
In our previous post we described the basic premise of Simpson’s Paradox: aggregate data and see one trend, separate your analyses and see another.
Today we’ll be short(er) and sweet, showing you a few more ways Simpson’s Paradox can pop up and lead you to misinterpret your data if you’re not careful.
As always, I encourage you to play along at home.
Interactions Disguised as No Effect
In this example, we ask whether a specific personality trait (say, extroversion) is related to job performance. As you will see, by an initial, combined analysis there is no overall effect. When we take job type into account, however, the effect is huge.
The code for creating the data and figures for this first example is included below.
set.seed(42) extro1 <- rnorm(200, 50, 10) extro2 <- rnorm(200, 50, 10) perf1 <- 25+ .5*extro1 + rnorm(200, 0, 8) perf2 <- 75 + -.5*extro2 + rnorm(200, 0, 8) job <- rep(c("job2", "job1"), each = 200) d <- data.frame(extro = c(extro1, extro2), perf = c(perf1, perf2), job_type = job)
We’ll start with a basic scatter plot and add a linear regression to highlight the apparent lack of relationship between extroversion and job performance.
ggplot(d, aes(x = extro, y = perf)) + geom_point() + xlab(label = "Extroversion") + ylab(label = "Job Performance") + geom_smooth(method = "lm")
We have an apparently directionless blob of data and a perfectly flat regression line. There’s a whole lotta nothin’ going on there.
But baked into our example data is a third factor: job type.
For the sake of the example, let’s say that job type 1 is a research-based role requiring solitary work and extended periods of intense focus. Bottom line, it offers and requires lots and lots of quiet.
In contrast, job type 2 is a client-facing role requiring regular interaction with multiple clients and multiple internal teams. Think lots and lots of people…calls, meetings, then more calls.
Personality, job role fit, and performance is of course a complicated business so I am really simplifying here, but broadly it’s reasonable to think introverts might enjoy an advantage in a quiet research role while extroverts might perform better in a people-facing position.
When we add this third factor to our considerations, the picture changes considerably.
The first plot highlights those in job type 1 (research and solitude). There is clearly a negative relationship, with performance declining as extroversion increases.
ggplot(d, aes(x = extro, y = perf, color = job_type)) + geom_point() + xlab(label = "Extroversion") + ylab(label = "Job Performance") + scale_color_manual(values = c("darkgreen", "gray"))
In the second plot, the trend reverses with those in job type 2 (client-facing) showing increased performance with higher levels of extroversion.
ggplot(d, aes(x = extro, y = perf, color = job_type)) + geom_point() + xlab(label = "Extroversion") + ylab(label = "Job Performance") + scale_color_manual(values = c("gray", "darkred"))
The third plot adds separate regression lines to aid the eye and drives the point home.
ggplot(d, aes(x = extro, y = perf, color = job_type)) + geom_point() + xlab(label = "Extroversion") + ylab(label = "Job Performance") + scale_color_manual(values = c("darkgreen", "darkred")) + geom_smooth(method = "lm")
But aggregation is not bad or good per se. It’s just that in some cases it can bury context or hide interactions that meaningfully impact your interpretation. In this case, it was hiding a strong interaction in which the relationship between extroversion and job performance depended on the job type.
The reverse can also be true: you can slice the data too finely and nuance everything to death and end up missing the bigger picture.
When the Loser Wins
This is a version of a batting averages example from Wikipedia.
We adapt it here to something more HR-like.
Let’s say we have two sales people, Chris and Elka. The metric is how many sales they close on first contact with the prospect. We compare them in Period 1, Period 2, and then overall.
In Period 1, Chris closes 12/ 48 for a closing rate of .250.
In the same period, Elka closes 104/411 for a success rate of .253. Edge to Elka.
In the second period, Elka again outperforms Chris with a success rate of .321 v .314.
But when we combine the numbers, we get a different result, with Chris showing a better overall closing rate than Elka, .310 to .270. How is this possible?
|Person||Period 1||Period 2||Overall|
|Chris||12/48 = .250||183/582 = .314||195/630 = .310|
|Elka||104/411 = .253||45/140 = .321||149/551 = .270|
The issue is in the number of observations per cell. When we divided our analysis into the two periods, Elka was clearly the winner both times. But note that in Period 2, Chris had many more observations at a fairly high success rate. When we combined the periods, Chris’s level of success in Period 2 with so many more observations essentially swamped the results. Chris emerged as the winner when we combined all the data.
That’s Simpson’s Paradox in a nutshell. Split the data and see one trend, combine the data and see another.
Today we provided two more examples of Simpson’s Paradox. But real life is more complicated than the illustrative cases here. The trick is to be thoughtful about when you split your data and when you aggregate. Be sure to check the number of observations if you collect data from different periods, groups, or levels of a variable. Think hard about any third variables that might be operating for one group versus another. Context, context, context.
It’s not that splitting is always better/worse or that aggregation is good/bad. Instead, remember the question you are asking and ask how your decisions to split or aggregate may impact what your results say.
No one is perfect and you can’t know everything but being mindful of tricky statistical issues like Simpson’s Paradox means you’ll operate with your eyes wide open and increase your chances of success.
Shared again here for your convenience….
- © 2022 HR Analytics 101