How to Break Down Your Turnover Data
Breaking Down Turnover
In previous posts and video tutorials we covered how to calculate turnover from your raw HR data.
Today, I’m going show you some simple ideas and techniques for breaking down that turnover data to gain basic insights and spur internal conversation.
At a high level, it means focusing on just a few basic things like the following:
- Demographics
- Internal splits (e.g. role, area)
- Key categorical and continuous variables
We’ll first see what data we have and then we’ll look for patterns.
In short, today’s post is about finding basic relationships in the data, not about establishing causation.
Details on things like time windows and dates are skipped here to keep our focus.
Nonetheless, this analysis outline can be immediately applied profitably to your own data at work.
Libraries
Let’s first load some libraries to make our work easier.
If you don’t already have these installed, install them with the “packages” option in RStudio.
library(inspectdf) # exploratory tools
library(psych) # more stats
library(tidyverse) # all things tidy like dplyr, ggplot, etc.
Load Data
Import the simulated data here.
d <- readr::read_csv("data/Sim_Turnover_Data_HR_Analytics_101_CSV (1).csv")
Summary Inspection
Rule #1: Always look at your data first.
The head and summary functions are a good start.
head(d)
## # A tibble: 6 x 8
## role perf area sex id age salary vol_leave
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CEO 3 Sales Male 1 62 1000000 0
## 2 Director 3 Marketing Male 32 53.4 258935. 0
## 3 Director 1 Finance Male 76 53.5 189828. 1
## 4 Director 2 Sales Male 69 49.2 207492. 0
## 5 Director 3 Marketing Male 28 49.8 188205. 0
## 6 Director 1 Other Female 77 39.6 194837. 0
summary(d)
## role perf area sex
## Length:11111 Min. :1.000 Length:11111 Length:11111
## Class :character 1st Qu.:2.000 Class :character Class :character
## Mode :character Median :2.000 Mode :character Mode :character
## Mean :2.198
## 3rd Qu.:3.000
## Max. :3.000
## id age salary vol_leave
## Min. : 1 Min. :22.02 Min. : 42168 Min. :0.0000
## 1st Qu.: 2778 1st Qu.:24.07 1st Qu.: 57081 1st Qu.:0.0000
## Median : 5556 Median :25.70 Median : 60798 Median :0.0000
## Mean : 5556 Mean :27.79 Mean : 65358 Mean :0.3812
## 3rd Qu.: 8334 3rd Qu.:28.49 3rd Qu.: 64945 3rd Qu.:1.0000
## Max. :11111 Max. :62.00 Max. :1000000 Max. :1.0000
This gives us a basic overview of our data, telling us that we have information on the kinds of variables we need to get a picture of turnover in our company.
Cleaning and Filtering
Unfortunately the summary
function doesn’t tell us much about the categorical variables. Let’s create a few tables to take a peek.
table(d$role)
##
## CEO Director Ind Manager VP
## 1 100 10000 1000 10
table(d$perf)
##
## 1 2 3
## 1117 6675 3319
That’s a bit more helpful but doing this for each categorical variable is tedious.
Let’s use the %>%
piping operator with the map function to speed this up.
As part of this we’ll also recode voluntary departure from 0/1 into plain language and modify the performance variable to make the levels clearer.
# new variables for easier interpretation
d$vol_leave2 <- ifelse(d$vol_leave == 1, "depart", "stay") # relabeling
d$perf2 <- ifelse(d$perf == 1, "low", ifelse(d$perf == 2, "med", "high")) # relabeling
## converting perf2 to a factor so we can order the values as low-med-high
d$perf2 <- factor(d$perf2, levels = c("low", "med", "high"))
## Tables all of our character columns
d %>% dplyr::select(where(is.character)) %>% map(table)
## $role
##
## CEO Director Ind Manager VP
## 1 100 10000 1000 10
##
## $area
##
## Accounting Finance Marketing Other Sales
## 1609 1677 2258 2198 3369
##
## $sex
##
## Female Male
## 6068 5043
##
## $vol_leave2
##
## depart stay
## 4236 6875
If we want percentages instead of raw counts, we can add the prop.table
function which takes our table sums and returns proportions.
d %>% dplyr::select(where(is.character)) %>% map(table) %>% map(prop.table)
## $role
##
## CEO Director Ind Manager VP
## 9.00009e-05 9.00009e-03 9.00009e-01 9.00009e-02 9.00009e-04
##
## $area
##
## Accounting Finance Marketing Other Sales
## 0.1448114 0.1509315 0.2032220 0.1978220 0.3032130
##
## $sex
##
## Female Male
## 0.5461255 0.4538745
##
## $vol_leave2
##
## depart stay
## 0.3812438 0.6187562
We don’t have many high-level employees here and in the real world their turnover dynamics are likely quite different from the average worker.
Let’s filter them out and then take a another look at age and salary again.
d <- d %>% dplyr::filter(role %in% c("Manager", "Ind"))
describe(d$age)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 11000 27.55 5.78 25.67 26.28 2.87 22.02 58.26 36.23 2.24 4.99
## se
## X1 0.06
hist(d$age, col = 'red3', border = "white", xlab = "Age", main = "Age Distribution: All Employees")
describe(d$salary)
## vars n mean sd median trimmed mad min max
## X1 1 11000 63640.8 12642.44 60730.62 60988.28 5723.78 42168.22 134840
## range skew kurtosis se
## X1 92671.73 2.41 5.88 120.54
hist(d$salary, col = 'red3', border = "white", xlab = "Salary", main = "Salary Distribution: All Employees")
We note first that our age skews young. This makes sense given that we are dealing with many individual contributors who tend be in the earlier stages in their careers.
In addition, we seem to also have basically two different pieces in our salary distribution. I’m guessing managers make more.
boxplot(salary ~ role, data = d, col = "red3")
Good to know. Time to move on.
Breaking It Down
Continuous Variables
Let’s see how the picture of turnover changes by these other variables.
We’ll first look at departures broken down by salary and age using boxplots. The idea here is to see if our leavers look any different when it comes to salary and age.
boxplot(age ~ vol_leave2, data = d, col = "red3")
boxplot(salary ~ vol_leave2, data = d, col = "red3")
No screamingly obvious relationships here so at a quick glance, salary doesn’t appear to be an obvious driver of turnover.
But that doesn’t mean it’s not related at all. The salary differences for individuals v. managers could be obscuring the relationship if there is one.
Let’s break these down further by role. We’ll do this for age too.
boxplot(age ~ vol_leave2 + role, data = d, col = "red3")