How to Avoid Aggregation Errors and Simpson’s Paradox In HR Analytics: Part 2
In our previous post we described the basic premise of Simpson’s Paradox: aggregate data and see one trend, separate your analyses and see another.
Today we’ll be short(er) and sweet, showing you a few more ways Simpson’s Paradox can pop up and lead you to misinterpret your data if you’re not careful.
As always, I encourage you to play along at home.
Interactions Disguised as No Effect
In this example, we ask whether a specific personality trait (say, extroversion) is related to job performance. As you will see, by an initial, combined analysis there is no overall effect. When we take job type into account, however, the effect is huge.
The code for creating the data and figures for this first example is included below.
set.seed(42)
extro1 <- rnorm(200, 50, 10)
extro2 <- rnorm(200, 50, 10)
perf1 <- 25+ .5*extro1 + rnorm(200, 0, 8)
perf2 <- 75 + -.5*extro2 + rnorm(200, 0, 8)
job <- rep(c("job2", "job1"), each = 200)
d <- data.frame(extro = c(extro1, extro2), perf = c(perf1, perf2),
job_type = job)
We’ll start with a basic scatter plot and add a linear regression to highlight the apparent lack of relationship between extroversion and job performance.
ggplot(d, aes(x = extro, y = perf)) + geom_point() +
xlab(label = "Extroversion") + ylab(label = "Job Performance") +
geom_smooth(method = "lm")
We have an apparently directionless blob of data and a perfectly flat regression line. There’s a whole lotta nothin’ going on there.
But baked into our example data is a third factor: job type.
For the sake of the example, let’s say that job type 1 is a research-based role requiring solitary work and extended periods of intense focus. Bottom line, it offers and requires lots and lots of quiet.
In contrast, job type 2 is a client-facing role requiring regular interaction with multiple clients and multiple internal teams. Think lots and lots of people…calls, meetings, then more calls.
Personality, job role fit, and performance is of course a complicated business so I am really simplifying here, but broadly it’s reasonable to think introverts might enjoy an advantage in a quiet research role while extroverts might perform better in a people-facing position.
When we add this third factor to our considerations, the picture changes considerably.
The first plot highlights those in job type 1 (research and solitude). There is clearly a negative relationship, with performance declining as extroversion increases.
ggplot(d, aes(x = extro, y = perf, color = job_type)) + geom_point() +
xlab(label = "Extroversion") + ylab(label = "Job Performance") +
scale_color_manual(values = c("darkgreen", "gray"))
In the second plot, the trend reverses with those in job type 2 (client-facing) showing increased performance with higher levels of extroversion.
ggplot(d, aes(x = extro, y = perf, color = job_type)) + geom_point() +
xlab(label = "Extroversion") + ylab(label = "Job Performance") +
scale_color_manual(values = c("gray", "darkred"))