## A Note to the Reader

This post emerged from an original, more limited post about transition probabilities.

I wanted to share a bit more about how to work your data to calculate them…and then it just kept getting bigger.

I’m including the full post here but the first, basic portion of this post will appear in a smaller standalone set of posts about basic HR metrics. I just didn’t want to leave interested readers totally out on their own if they really wanted to dig into working with the data.

## Definition

Transition probability is the probability of someone in one role (or state) transitioning to another role (or state) within some fixed period of time.

The year is the typical unit of time but as with other metrics that depend on events with a lower frequency, I recommend you look at longer periods (e.g. 2 years) too.

## Description

We’ll get concrete and start with a group of 100 employees working at a call center as of Jan 1st. Over the course of the next 12 months we have the following changes:

• 27 quit the company
• 14 are “invited to flourish elsewhere”
• 9 are promoted
• 9 move to another role in the company
• 41 remain in their current role

Framed in terms of transition probabilities and given that we have 100 employees we would just say the following:

• .27 probability of quiting the company (27/100)
• .14 probability of being “invited to flourish elsewhere” (14/100)
• .09 probability of being promoted (9/100)
• .09 probability of moving to another role in the company (9/100)
• .41 probability of remaining in their current role (41/100)

The concept is that simple.

As another example, if I start with 85 employees and then 10 leave within the next year, 7 are promoted, and the rest stay the same I would have the following transition probabilities: * .12 probability of leaving (10/85) * .08 probability of being promoted (7/85) * .80 probability of staying in the same role (68/85)

Most HR organizations are not probably not talking about transition probabilities but they give you a clear advantage in understanding employee movement, especially when combined with some quality visualizations (see below).

If you want to get WAY into the math on this kind of thing check out Markov Chains. It’s not necessary at all but some of you might like to scratch that mathemtical itch.

## Why you should care

It forces you see all the transitions in the same terms and at the same time.

There is no debating definitions.

Just cold, informative numbers.

## Example in R

The trick to this is really in the data manipulation. HR analytsts work with many different data structures so I can’t account for each path but I think the simplest way to do it is take a series of snapshots at the same point in two consecutive years (e.g. Jan 1st).

Once you start with that, you can bootstrap yourself into doing multiple periods.

We’ll begin with an example of 120 people in the call center to illustrate the final step in the calculations.

We create some data below with the following general characterstics: * 120 people in Year 1 * 135 people in Year 2 (including some new hires) * Some people in Year 1 stayed with the company (and are therefore also present in Year 2) while others left.

## Making the Data

I’ll show you step-by-step how we actually make these illustrative data for those playing at home (which I always recommend).

Of course, you will not need create data when you are working with your own real HR data but reinforcing your R skills is never a bad idea.

Even if you don’t care about the creation of the data itself, pay attention to the merge/ join step at the end. You will probably need to do something similar for your data manipulation.

## Making Year 1 snapshot data
set.seed(42)
emp_num <- 1:120
level <- sample(c(1,2,3,4), 120, replace = T,  prob = c(.7, .2,.08, .02)) ## creating different role levels
dept <- rep('cust_rel', 120)
year <- rep(2018, 120)

y1 <- data.frame(emp_num, level, dept, year)

Now let’s make some data for our people in Year 2.

set.seed(23)
### starting with our year 1 data as a basis
y2 <- y1

### Randomly choosing 20 employees to leave:

leave_ind <- sample(1:120, 40)

### Dropping those rows because they left
y2 <- y2[-c(leave_ind),]

### updating values for the people who stayed
y2$year <- 2019 y2$dept <- sample(c('cust_rel', 'marketing', 'sales'),dim(y2)[1], prob = c(.7, .2, .1), replace = T)

### Some people move up, others stay on the same level:
y2$level <- ifelse(y2$level == 1, sample(c(1,2), prob = c(.7,.3)),
ifelse(y2$level == 2, sample(c(2,3), prob = c(.8, .2)), ifelse(y2$level == 3, sample(c(3,4), prob = c(.8, .2)), y2$level))) ## creating new hires in our call center to append to our Year 2 set ## Simulating that we have new people in year 2 emp_num <- 121:175 level <- sample(c(1,2), 55, replace = T, prob = c(.7, .3)) ## creating different role levels dept <- rep('cust_rel', 55) year <- rep(2019, 55) new_hires <- data.frame(emp_num, level, dept, year) ## adding on the new hires at the bottom of our current Year 2 group # This forms our Year 2 snapshot data y2 <- rbind(y2, new_hires) Now let’s take look at our two data snapshots. library(knitr) # for nice table formatting kable(head(y1)) emp_num level dept year 1 3 cust_rel 2018 2 3 cust_rel 2018 3 1 cust_rel 2018 4 2 cust_rel 2018 5 1 cust_rel 2018 6 1 cust_rel 2018 kable(head(y2)) emp_num level dept year 1 3 marketing 2019 2 4 cust_rel 2019 3 1 cust_rel 2019 4 3 cust_rel 2019 5 1 marketing 2019 6 2 marketing 2019 ## showing that I have new hires in Year 2 data kable(tail(y2))  emp_num level dept year 501 170 1 cust_rel 2019 511 171 2 cust_rel 2019 521 172 1 cust_rel 2019 531 173 1 cust_rel 2019 54 174 2 cust_rel 2019 551 175 2 cust_rel 2019 Please note that I added the new hires to your Year 2 data so you would actually have data to play with and learn how to handle them differently on your own. I’ll end up dropping immediately as you will see below but it’s good to be aware of this step. ### Merging Year 1 and Year 2 Data Now we need to bring our Year 1 and Year 2 data together to figure out the transition probabilities. As part of that we also need to drop those people in Year 2 that were not with us in Year 1. Why? Because we are focused on the transition and people who were not employed at the time of our first snapshot cannot be part of that analysis, at least if we want to keep things simple. In addition, we need to figure out a way keep track of those who were in Year 1 but not in Year 2. You could do this by tracking the date of departure but that make things more complicated. Fortunately we can accomplish what we need to in a single left join. This function should be familiar to any SQL users out there. In R, we accomplish this with the merge function. As a review, the goal of this merge/join is the following: • Bring the Year 1 and Year 2 data together to create the transition probabilities • Keep everyone from Year 1 even if they have no Year 2 data • If they are still around we will see both the Year 1 and Year 2 status • If they left, we’ll have blank information in our Year 2 columns which we can then use to identify departures • Drop everyone that was not with us in Year 1 ### Merging on the employee number ### Adding the suffixes trans1 <- merge(y1, y2, by = 'emp_num', all.x = T, suffixes = c(".y1",".y2")) Now let’s actually look at the data structure kable(head(trans1)) emp_num level.y1 dept.y1 year.y1 level.y2 dept.y2 year.y2 1 3 cust_rel 2018 3 marketing 2019 2 3 cust_rel 2018 4 cust_rel 2019 3 1 cust_rel 2018 1 cust_rel 2019 4 2 cust_rel 2018 3 cust_rel 2019 5 1 cust_rel 2018 1 marketing 2019 6 1 cust_rel 2018 2 marketing 2019 From our dimension and summary inspection we can see that dropped all of the new hires. dim(trans1) ## [1] 120 7 summary(trans1) ## emp_num level.y1 dept.y1 year.y1 ## Min. : 1.00 Min. :1.000 cust_rel:120 Min. :2018 ## 1st Qu.: 30.75 1st Qu.:1.000 1st Qu.:2018 ## Median : 60.50 Median :1.000 Median :2018 ## Mean : 60.50 Mean :1.508 Mean :2018 ## 3rd Qu.: 90.25 3rd Qu.:2.000 3rd Qu.:2018 ## Max. :120.00 Max. :4.000 Max. :2018 ## ## level.y2 dept.y2 year.y2 ## Min. :1.000 Length:120 Min. :2019 ## 1st Qu.:1.000 Class :character 1st Qu.:2019 ## Median :2.000 Mode :character Median :2019 ## Mean :2.013 Mean :2019 ## 3rd Qu.:2.000 3rd Qu.:2019 ## Max. :4.000 Max. :2019 ## NA's :40 NA's :40 We can also see that we have a bunch of NAs. When we did our left join, those employees were retained but the values are empty because they were not there at the beginning of Year 2. We’ll recode these NAs as ‘depart’ so we can identify those who left the company. trans1[is.na(trans1)] <- 'depart' ## Calculating Transition Probabilities There are many ways to handle this but I think the easier path is to use the prop.table function. Note for our Excel users that once you create the table you need (probably with some wrangling in SQL, R, etc.) you can use a pivot table to calculate the transition probabilities by displaying the values as a proportion. We’ll start with the levels value but throw in the round function to make it easier to read. kable(round(prop.table(table(trans1$level.y1, trans1$level.y2), 1), 2), row.names = T) 1 2 3 4 depart 1 0.33 0.33 0.00 0.00 0.34 2 0.00 0.39 0.30 0.00 0.30 3 0.00 0.00 0.25 0.44 0.31 4 0.00 0.00 0.00 0.50 0.50 This table shows us that those starting at level 1 in Year 1 have a .33 probability of staying at level 1, .33 probability of moving to level 2, and a .34 probability of leaving the company. Those starting at level 3 in Year 1 have a .25 probability of staying at the same level, .44 probability of moving up etc. We can do the same kinds of calculations for our department transitions, although our analysis starts with a focus on customer relations so that is the only depart in Year 1. kable(round(prop.table(table(trans1$dept.y1, trans1$dept.y2), 1), 2), row.names = T) cust_rel depart marketing sales cust_rel 0.44 0.33 0.17 0.06 ### Visualizing with Stacked Bar Graphs One easy way to visualize the transition probabilities for the different levels is with stacked bargraphs. library(ggplot2) library(reshape2) gg_df <- data.frame(round(prop.table(table(trans1$level.y1, trans1$level.y2), 1), 2)) names(gg_df) <- c('Level_Y1', 'Level_Y2', 'Prob') ### making a factor and using the new labels so they appear in the facet title gg_df$Level_Y1 <- factor(gg_df$Level_Y1, levels = c(1,2,3,4), labels = c('Year 1, Level 1','Year 1, Level 2', 'Year 1, Level 3', 'Year 1, Level 4' )) ggplot(data = gg_df, aes(x= Level_Y2, y = Prob, fill = Level_Y2)) + geom_bar(stat = 'identity', show.legend = F) + facet_wrap(~Level_Y1, ncol = 1)+ xlab('Year 2 Level') + scale_fill_manual(values=c('gray', 'gray', 'gray', 'gray', 'red3'))  ### Visualizing with a Sankey Diagram The Sankey diagram is bit more interesting for our visualizations. It’s also a lot more work but it helps to see other ways to display data. We’ll first create a dataframe from the probabilities table. Then we’ll convert this information to a set of source locations (year 1 level) and target locations (year 2 level). This will be contained in our links dataframe. We’ll also need a separate nodes dataframe to hold the names we want on the figure. library(dplyr) library(networkD3) library(tidyr) library(RColorBrewer) # links<- data.frame(round(prop.table(table(trans1$level.y1, trans1$level.y2), 1), 2)) links<- data.frame(table(trans1$level.y1, trans1$level.y2)) names(links) <- c('source', 'target', 'value') # We'll cast our source and targets as numerics, 'depart' here as '5' ind <- which(links$target == 'depart')
links$source <- as.numeric(as.character(links$source))
links$target <- as.numeric(as.character(links$target)) # recoding as character to add the 5
links$target[ind] <- 5 # add the five ### Links must be zero indexed so subtracting one from the source values links$source <- links$source - 1 # we are also adding 3 to the target values. # This will give us a total of 9 nodes (4 for the year 1 locations, 5 for year 2) links$target <- links\$target + 3

# Then we'll create nice names in the Sankey diagram

nodes <- data.frame(node = c(0:8), name  = c('Level 1 Y1', 'Level 2 Y1', 'Level 3 Y1', 'Level 4 Y1',
'Level 1 Y2', 'Level 2 Y2', 'Level 3 Y2', 'Level 4 Y2', 'Departure'))

Source = 'source',
Target = 'target',
Value = 'value',
NodeID = 'name', fontSize = 12)

## Final Thoughts

I’ve included a ton of code here if you want to work through the examples and get into the nitty-gritty of calculating transition probabilities. Regardless of how far you want to dig down into all of the details, remember the basic concept of tracking the probabilities of movement to a certain state/ role over time.

Focusing on transition probabilities and comparing them across departments of your organization will help you frame employee movement in global analytic terms and see the bigger picture.