loader

How to Break Down Your Turnover Data

Breaking Down Turnover

In previous posts and video tutorials we covered how to calculate turnover from your raw HR data.

Today, I’m going show you some simple ideas and techniques for breaking down that turnover data to gain basic insights and spur internal conversation.

At a high level, it means focusing on just a few basic things like the following:

  • Demographics
  • Internal splits (e.g. role, area)
  • Key categorical and continuous variables

We’ll first see what data we have and then we’ll look for patterns.

In short, today’s post is about finding basic relationships in the data, not about establishing causation.

Details on things like time windows and dates are skipped here to keep our focus.

Nonetheless, this analysis outline can be immediately applied profitably to your own data at work.

Libraries

Let’s first load some libraries to make our work easier.

If you don’t already have these installed, install them with the “packages” option in RStudio.

library(inspectdf) # exploratory tools
library(psych) # more stats
library(tidyverse) # all things tidy like dplyr, ggplot, etc. 

Load Data

Import the simulated data here.

d <- readr::read_csv("data/Sim_Turnover_Data_HR_Analytics_101_CSV (1).csv")

Summary Inspection

Rule #1: Always look at your data first.

The head and summary functions are a good start.

head(d)
## # A tibble: 6 x 8
##   role      perf area      sex       id   age   salary vol_leave
##   <chr>    <dbl> <chr>     <chr>  <dbl> <dbl>    <dbl>     <dbl>
## 1 CEO          3 Sales     Male       1  62   1000000          0
## 2 Director     3 Marketing Male      32  53.4  258935.         0
## 3 Director     1 Finance   Male      76  53.5  189828.         1
## 4 Director     2 Sales     Male      69  49.2  207492.         0
## 5 Director     3 Marketing Male      28  49.8  188205.         0
## 6 Director     1 Other     Female    77  39.6  194837.         0
summary(d)
##      role                perf           area               sex           
##  Length:11111       Min.   :1.000   Length:11111       Length:11111      
##  Class :character   1st Qu.:2.000   Class :character   Class :character  
##  Mode  :character   Median :2.000   Mode  :character   Mode  :character  
##                     Mean   :2.198                                        
##                     3rd Qu.:3.000                                        
##                     Max.   :3.000                                        
##        id             age            salary          vol_leave     
##  Min.   :    1   Min.   :22.02   Min.   :  42168   Min.   :0.0000  
##  1st Qu.: 2778   1st Qu.:24.07   1st Qu.:  57081   1st Qu.:0.0000  
##  Median : 5556   Median :25.70   Median :  60798   Median :0.0000  
##  Mean   : 5556   Mean   :27.79   Mean   :  65358   Mean   :0.3812  
##  3rd Qu.: 8334   3rd Qu.:28.49   3rd Qu.:  64945   3rd Qu.:1.0000  
##  Max.   :11111   Max.   :62.00   Max.   :1000000   Max.   :1.0000

This gives us a basic overview of our data, telling us that we have information on the kinds of variables we need to get a picture of turnover in our company.

Cleaning and Filtering

Unfortunately the summary function doesn’t tell us much about the categorical variables. Let’s create a few tables to take a peek.

table(d$role)
## 
##      CEO Director      Ind  Manager       VP 
##        1      100    10000     1000       10
table(d$perf)
## 
##    1    2    3 
## 1117 6675 3319

That’s a bit more helpful but doing this for each categorical variable is tedious.

Let’s use the %>% piping operator with the map function to speed this up.

As part of this we’ll also recode voluntary departure from 0/1 into plain language and modify the performance variable to make the levels clearer.

# new variables for easier interpretation
d$vol_leave2 <- ifelse(d$vol_leave == 1, "depart", "stay") # relabeling
d$perf2 <- ifelse(d$perf == 1, "low", ifelse(d$perf == 2, "med", "high")) # relabeling

## converting perf2 to a factor so we can order the values as low-med-high
d$perf2 <- factor(d$perf2, levels = c("low", "med", "high"))

## Tables all of our character columns
d %>% dplyr::select(where(is.character)) %>% map(table)
## $role
## 
##      CEO Director      Ind  Manager       VP 
##        1      100    10000     1000       10 
## 
## $area
## 
## Accounting    Finance  Marketing      Other      Sales 
##       1609       1677       2258       2198       3369 
## 
## $sex
## 
## Female   Male 
##   6068   5043 
## 
## $vol_leave2
## 
## depart   stay 
##   4236   6875

If we want percentages instead of raw counts, we can add the prop.table function which takes our table sums and returns proportions.

d %>% dplyr::select(where(is.character)) %>% map(table) %>% map(prop.table)
## $role
## 
##         CEO    Director         Ind     Manager          VP 
## 9.00009e-05 9.00009e-03 9.00009e-01 9.00009e-02 9.00009e-04 
## 
## $area
## 
## Accounting    Finance  Marketing      Other      Sales 
##  0.1448114  0.1509315  0.2032220  0.1978220  0.3032130 
## 
## $sex
## 
##    Female      Male 
## 0.5461255 0.4538745 
## 
## $vol_leave2
## 
##    depart      stay 
## 0.3812438 0.6187562

We don’t have many high-level employees here and in the real world their turnover dynamics are likely quite different from the average worker.

Let’s filter them out and then take a another look at age and salary again.

d <- d %>% dplyr::filter(role %in% c("Manager", "Ind"))
describe(d$age)
##    vars     n  mean   sd median trimmed  mad   min   max range skew kurtosis
## X1    1 11000 27.55 5.78  25.67   26.28 2.87 22.02 58.26 36.23 2.24     4.99
##      se
## X1 0.06
hist(d$age, col = 'red3', border = "white", xlab = "Age", main = "Age Distribution: All Employees")

describe(d$salary)
##    vars     n    mean       sd   median  trimmed     mad      min    max
## X1    1 11000 63640.8 12642.44 60730.62 60988.28 5723.78 42168.22 134840
##       range skew kurtosis     se
## X1 92671.73 2.41     5.88 120.54
hist(d$salary, col = 'red3', border = "white", xlab = "Salary", main = "Salary Distribution: All Employees")


We note first that our age skews young. This makes sense given that we are dealing with many individual contributors who tend be in the earlier stages in their careers.

In addition, we seem to also have basically two different pieces in our salary distribution. I’m guessing managers make more.

boxplot(salary ~ role, data = d, col = "red3")


Good to know. Time to move on.

Breaking It Down

Continuous Variables

Let’s see how the picture of turnover changes by these other variables.

We’ll first look at departures broken down by salary and age using boxplots. The idea here is to see if our leavers look any different when it comes to salary and age.

boxplot(age ~ vol_leave2, data = d, col = "red3")

boxplot(salary ~ vol_leave2, data = d, col = "red3")


No screamingly obvious relationships here so at a quick glance, salary doesn’t appear to be an obvious driver of turnover.

But that doesn’t mean it’s not related at all. The salary differences for individuals v. managers could be obscuring the relationship if there is one.

Let’s break these down further by role. We’ll do this for age too.

boxplot(age ~ vol_leave2 + role, data = d, col = "red3")

boxplot(salary ~ vol_leave2 + role, data = d, col = "red3")

Not huge but there is a hint that those who earn a little more might be more likely to stay (reminder: we are just noting the possible relationship, NOT asserting a causal link).

This is the kind of thing we would want to remember if we were building formal models. Right now though we are just going for big picture.

Finally, we can be a little more exacting on the values if we run a summary function that splits on departure and role and returns means for salary and age in a table.

d %>% group_by(role, vol_leave2) %>% summarize_at(vars(age, salary), mean)
## # A tibble: 4 x 4
## # Groups:   role [2]
##   role    vol_leave2   age  salary
##   <chr>   <chr>      <dbl>   <dbl>
## 1 Ind     depart      25.9  58941.
## 2 Ind     stay        26.1  60746.
## 3 Manager depart      43.1  97426.
## 4 Manager stay        42.8 100986.

Again, the differences are not huge but something to note.

Categorical Variables

We’ve looked at the contiuous variables of age and salary.

Now we need to break down the proportion of departures by our categorical variables. We go through just a few so you can get the idea.

We’ll begin by constructing tables and again wrapping them in prop.table.

I’ll first do this using the base R method.

Then I’ll use dplyr and piping for the others although the outputs are basically the same.

Note that I need to include the 1 in the prop.table statement so it returns the proportion of leavers and stayers for each level of the categorical variables.

# Male v. Female
prop.table(table(d$sex, d$vol_leave2), 1)
##         
##             depart      stay
##   Female 0.4684849 0.5315151
##   Male   0.2787247 0.7212753
# performance
d %>% dplyr::select(perf2, vol_leave2) %>%
      table() %>% prop.table(1) %>% 
      round(2)
##       vol_leave2
## perf2  depart stay
##   low    0.34 0.66
##   med    0.34 0.66
##   high   0.49 0.51
#role
d %>% dplyr::select(role, vol_leave2) %>%
      table() %>% prop.table(1) %>% 
      round(2)
##          vol_leave2
## role      depart stay
##   Ind       0.37 0.63
##   Manager   0.46 0.54
# area
d %>% dplyr::select(area, vol_leave2) %>%
      table() %>% prop.table(1) %>% 
      round(2)
##             vol_leave2
## area         depart stay
##   Accounting   0.30 0.70
##   Finance      0.31 0.69
##   Marketing    0.28 0.72
##   Other        0.31 0.69
##   Sales        0.57 0.43


There are huge differences in all of these areas.

For example, we see that females are much more likely to leave than males. In the real world, we would want to investigate such differences immediately.

Similarly, we see that those in sales are also much more likely to leave. This could signal that people in sales are in high demand in the job market, that they are very unsatisfied with the department, some combination, or something else.

Finally, let’s take these same values but now create bar plots to make the results hit home.

# Male v. Female
prop.table(table(d$sex, d$vol_leave2), 1) %>% round(2) %>%
      as.data.frame() %>% 
      ggplot(aes(x = Var1, y = Freq, fill = Var2)) +
      geom_bar(stat = "identity") + scale_fill_hue(l=40)

# performance
d %>% dplyr::select(perf2, vol_leave2) %>%
      table() %>% prop.table(1) %>% 
      round(2) %>%
      as.data.frame() %>% 
      ggplot(aes(x = perf2, y = Freq, fill = vol_leave2)) +
      geom_bar(stat = "identity") + scale_fill_hue(l=40)

#role
d %>% dplyr::select(role, vol_leave2) %>%
      table() %>% prop.table(1) %>% 
      round(2) %>%
      as.data.frame() %>% 
      ggplot(aes(x = role, y = Freq, fill = vol_leave2)) +
      geom_bar(stat = "identity") + scale_fill_hue(l=40)

# area
d %>% dplyr::select(area, vol_leave2) %>%
      table() %>% prop.table(1) %>% 
      round(2) %>%
      as.data.frame() %>% 
      ggplot(aes(x = area, y = Freq, fill = vol_leave2)) +
      geom_bar(stat = "identity") + scale_fill_hue(l=40)

Summary

Our purpose here was to learn how to breakdown turnover data into pieces and gain insights.

Big picture, keep it simple first and think in terms of categorical v. continuous, of demographics, and of basic internal splits (e.g. area, role).

This will help you keep your analyses focused, specific, and concrete.

That’s it for today!

Have fun exploring your own turnover data and please feel free to reach out with any questions.

Addendum: Other Tools

I recently learned about the inspectdf package at a great resource called. Little Miss Data that I wanted to share.

We’ll use the inspect_cat function on two dataframes, one for managers and one for individuals.

Then we’ll do the same for males v. females and throw in some correlations that you could also consider for your own data.

We are just touching the tip of the iceberg here so feel free to explore and experiment on your own.

# Role: Departure only
mgmt <- d %>% dplyr::filter(role== "Manager") %>% dplyr::select(vol_leave2)
ind <- d %>% dplyr::filter(role == "Ind") %>% dplyr::select(vol_leave2)
inspect_cat(mgmt, ind) %>% show_plot()

# male/ female: Depart only

female <- d %>% dplyr::filter(sex== "Female") %>% dplyr::select(vol_leave2)
male <- d %>% dplyr::filter(sex == "Male") %>% dplyr::select(vol_leave2)
inspect_cat(female, male) %>% show_plot()