How to Break Down Your Turnover Data

Breaking Down Turnover

In previous posts and video tutorials we covered how to calculate turnover from your raw HR data.

Today, I’m going show you some simple ideas and techniques for breaking down that turnover data to gain basic insights and spur internal conversation.

At a high level, it means focusing on just a few basic things like the following:

  • Demographics
  • Internal splits (e.g. role, area)
  • Key categorical and continuous variables

We’ll first see what data we have and then we’ll look for patterns.

In short, today’s post is about finding basic relationships in the data, not about establishing causation.

Details on things like time windows and dates are skipped here to keep our focus.

Nonetheless, this analysis outline can be immediately applied profitably to your own data at work.

Libraries

Let’s first load some libraries to make our work easier.

If you don’t already have these installed, install them with the “packages” option in RStudio.

library(inspectdf) # exploratory tools
library(psych) # more stats
library(tidyverse) # all things tidy like dplyr, ggplot, etc. 

Load Data

Import the simulated data here.

d <- readr::read_csv("data/Sim_Turnover_Data_HR_Analytics_101_CSV (1).csv")

Summary Inspection

Rule #1: Always look at your data first.

The head and summary functions are a good start.

head(d)
## # A tibble: 6 x 8
##   role      perf area      sex       id   age   salary vol_leave
##   <chr>    <dbl> <chr>     <chr>  <dbl> <dbl>    <dbl>     <dbl>
## 1 CEO          3 Sales     Male       1  62   1000000          0
## 2 Director     3 Marketing Male      32  53.4  258935.         0
## 3 Director     1 Finance   Male      76  53.5  189828.         1
## 4 Director     2 Sales     Male      69  49.2  207492.         0
## 5 Director     3 Marketing Male      28  49.8  188205.         0
## 6 Director     1 Other     Female    77  39.6  194837.         0
summary(d)
##      role                perf           area               sex           
##  Length:11111       Min.   :1.000   Length:11111       Length:11111      
##  Class :character   1st Qu.:2.000   Class :character   Class :character  
##  Mode  :character   Median :2.000   Mode  :character   Mode  :character  
##                     Mean   :2.198                                        
##                     3rd Qu.:3.000                                        
##                     Max.   :3.000                                        
##        id             age            salary          vol_leave     
##  Min.   :    1   Min.   :22.02   Min.   :  42168   Min.   :0.0000  
##  1st Qu.: 2778   1st Qu.:24.07   1st Qu.:  57081   1st Qu.:0.0000  
##  Median : 5556   Median :25.70   Median :  60798   Median :0.0000  
##  Mean   : 5556   Mean   :27.79   Mean   :  65358   Mean   :0.3812  
##  3rd Qu.: 8334   3rd Qu.:28.49   3rd Qu.:  64945   3rd Qu.:1.0000  
##  Max.   :11111   Max.   :62.00   Max.   :1000000   Max.   :1.0000

This gives us a basic overview of our data, telling us that we have information on the kinds of variables we need to get a picture of turnover in our company.

Cleaning and Filtering

Unfortunately the summary function doesn’t tell us much about the categorical variables. Let’s create a few tables to take a peek.

table(d$role)
## 
##      CEO Director      Ind  Manager       VP 
##        1      100    10000     1000       10
table(d$perf)
## 
##    1    2    3 
## 1117 6675 3319

That’s a bit more helpful but doing this for each categorical variable is tedious.

Let’s use the %>% piping operator with the map function to speed this up.

As part of this we’ll also recode voluntary departure from 0/1 into plain language and modify the performance variable to make the levels clearer.

# new variables for easier interpretation
d$vol_leave2 <- ifelse(d$vol_leave == 1, "depart", "stay") # relabeling
d$perf2 <- ifelse(d$perf == 1, "low", ifelse(d$perf == 2, "med", "high")) # relabeling

## converting perf2 to a factor so we can order the values as low-med-high
d$perf2 <- factor(d$perf2, levels = c("low", "med", "high"))

## Tables all of our character columns
d %>% dplyr::select(where(is.character)) %>% map(table)
## $role
## 
##      CEO Director      Ind  Manager       VP 
##        1      100    10000     1000       10 
## 
## $area
## 
## Accounting    Finance  Marketing      Other      Sales 
##       1609       1677       2258       2198       3369 
## 
## $sex
## 
## Female   Male 
##   6068   5043 
## 
## $vol_leave2
## 
## depart   stay 
##   4236   6875

If we want percentages instead of raw counts, we can add the prop.table function which takes our table sums and returns proportions.

d %>% dplyr::select(where(is.character)) %>% map(table) %>% map(prop.table)
## $role
## 
##         CEO    Director         Ind     Manager          VP 
## 9.00009e-05 9.00009e-03 9.00009e-01 9.00009e-02 9.00009e-04 
## 
## $area
## 
## Accounting    Finance  Marketing      Other      Sales 
##  0.1448114  0.1509315  0.2032220  0.1978220  0.3032130 
## 
## $sex
## 
##    Female      Male 
## 0.5461255 0.4538745 
## 
## $vol_leave2
## 
##    depart      stay 
## 0.3812438 0.6187562

We don’t have many high-level employees here and in the real world their turnover dynamics are likely quite different from the average worker.

Let’s filter them out and then take a another look at age and salary again.

d <- d %>% dplyr::filter(role %in% c("Manager", "Ind"))
describe(d$age)
##    vars     n  mean   sd median trimmed  mad   min   max range skew kurtosis
## X1    1 11000 27.55 5.78  25.67   26.28 2.87 22.02 58.26 36.23 2.24     4.99
##      se
## X1 0.06
hist(d$age, col = 'red3', border = "white", xlab = "Age", main = "Age Distribution: All Employees")

describe(d$salary)
##    vars     n    mean       sd   median  trimmed     mad      min    max
## X1    1 11000 63640.8 12642.44 60730.62 60988.28 5723.78 42168.22 134840
##       range skew kurtosis     se
## X1 92671.73 2.41     5.88 120.54
hist(d$salary, col = 'red3', border = "white", xlab = "Salary", main = "Salary Distribution: All Employees")


We note first that our age skews young. This makes sense given that we are dealing with many individual contributors who tend be in the earlier stages in their careers.

In addition, we seem to also have basically two different pieces in our salary distribution. I’m guessing managers make more.

boxplot(salary ~ role, data = d, col = "red3")


Good to know. Time to move on.

Breaking It Down

Continuous Variables

Let’s see how the picture of turnover changes by these other variables.

We’ll first look at departures broken down by salary and age using boxplots. The idea here is to see if our leavers look any different when it comes to salary and age.

boxplot(age ~ vol_leave2, data = d, col = "red3")

boxplot(salary ~ vol_leave2, data = d, col = "red3")


No screamingly obvious relationships here so at a quick glance, salary doesn’t appear to be an obvious driver of turnover.

But that doesn’t mean it’s not related at all. The salary differences for individuals v. managers could be obscuring the relationship if there is one.

Let’s break these down further by role. We’ll do this for age too.

boxplot(age ~ vol_leave2 + role, data = d, col = "red3")