Predictive HR Analytics: How to Build a Basic Logistic Regression Model



Overview

In our previous post, we explained the basics of logistic regression, what it is, what it does, and why you should care.

In today’s tutorial, we step you through an example logistics regression in R so you can set up your own analyses, understand the output, and start down the path of creating basic predictive models.

For our more experienced readers, please note we are focusing on just the basics of logistic regression so we won’t complicate the lesson with things like time windows or separate training and test sets. I cover those topics in depth in my extended tutorial.

The dummy dataset we will use can be downloaded here.

Loading Data

I’ll start loading some libraries like dplyr from the tidyverse to make our work easier and then load our data.

I assigned the data to d because short names mean less typing.

Note: If you aren’t familiar with the tidy packages and philosophy, you should get familiar with the tidy packages and philosophy . It pays off in time saved.

### Load libraries
### for magrittr piping, dplyr, ggplot, etc. 
### if you have trouble installing tidyverse within Rstudio
### try from within the RGui instead. 
library(tidyverse)

d <- readr::read_csv("data/Sim_Turnover_Data_HR_Analytics_101_CSV (1).csv")

Variable Selection

Our first step is to take a look at the data and get a feel for what variables might matter.

Some (like age and gender) should always be part of your exploration and analysis, while others may emerge after your descriptive analysis.

The key idea is that you want to approach your modeling with some general understanding of your data and reasonably related factors. If you skip this step, you will end up throwing a lot of garbage into the model and you’ll get a lot of garbage out.

Our goal is to model the likelihood of a voluntary departure so we’ll keep things simple and try to identify which of our variables are correlated with the outcome.

The vol_leave outcome variable is coded as 0/1 so we’ll look at the proportions of quitting against the other variables.

Let’s first look at the roles we have.

table(d$role) %>% barplot(border = NA, main = "Roles in Dataset", col = "#2334A6")

We have a ton of individual contributors and managers and only a handful of the more senior roles.

The career dynamics for those senior positions are likely quite different, so let’s drop those from our current analysis using the filter function from dplyr and assign that filtered data to a new dataframe, d2; if you want to look at turnover for Director and up positions, do that separately.

### filter down to just ind contr and managers and assign to new dataframe

d2 <- d %>% dplyr::filter(role %in% c("Ind", "Manager"))

Now we can take a look at the proportions of departures by some of our other variables.

We’ll start with a males/females comparison using the table and prop.table functions along with the pipe function %>% from the magrittr package.

table(d2$sex, d2$vol_leave) %>% prop.table(1) %>% round(2)
##         
##             0    1
##   Female 0.53 0.47
##   Male   0.72 0.28

Females have a voluntary departure of 47% v. 28% for males, a major difference. We’ll be sure to keep that as a predictor.

Now let’s see if we have any evidence of differences based on the business area.

To do this, we’ll first split by business area and then compare the departure percentages across areas similar to do what we did above.

We could use the table command but let’s try another way using some dplyr functions instead.

d2 %>% group_by(area) %>%
      dplyr::summarize(n = n(), perc_vol_leave = sum(vol_leave)/n) %>% 
      dplyr::mutate(perc_vol_leave = round(perc_vol_leave, 2))
## # A tibble: 5 x 3
##   area           n perc_vol_leave
##   <chr>      <int>          <dbl>
## 1 Accounting  1595          0.3  
## 2 Finance     1662          0.31 
## 3 Marketing   2233          0.28 
## 4 Other       2173          0.31 
## 5 Sales       3337          0.570

Huge difference for those in sales too.

We’ll keep that one as well but let’s simplify that a little. Instead of creating a whole set of dummy variables for the different areas, we’ll make a new, single variable that codes whether someone is in Sales (1) or not in Sales (0).

d2$sales <- ifelse(d2$area  == "Sales", 1, 0)

Time to make some models.

Model 0: The Null Model

We begin with an empty model (or the “null” model) which is a model with no predictors, just the intercept. Understanding the results of a model with no predictors first reinforces the basic concepts of the logistic model and will be key to understanding our later models with predictors.

R Code for the Null Model

  • We use the function glm which stands for the general linear model
  • We are “predicting” vol_leave here but without any predictors yet. We write that using the formula vol_leave ~ 1 where the ~1 just tells R that we only want the intercept. Note: In the later models we’ll replace the ~1 with predictor variables.
  • We set data = d2 to tell R that our data is in the d2 object (the dataset with just managers and individual contributors).
  • We set family = "binomial" to tell R that we want a logistic regression (i.e. using the logit linking function).
  • We’ll store the new model result in m0 (for the null model) although we could name it whatever we want.
m0 <- glm(formula = vol_leave ~ 1, data = d2,  family = 'binomial')
summary(m0)
## 
## Call:
## glm(formula = vol_leave ~ 1, family = "binomial", data = d2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9818  -0.9818  -0.9818   1.3865   1.3865  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.47914    0.01962  -24.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14636  on 10999  degrees of freedom
## Residual deviance: 14636  on 10999  degrees of freedom
## AIC: 14638
## 
## Number of Fisher Scoring iterations: 4

Interpreting the Null Model

We’ll focus on the intercept which is our first (and only) coefficient in the null model. We’ll ignore the significance indicator for the intercept because it’s not really meaningful for the intercept but it’ll be critical for interpreting predictors in the next models.

Our intercept value -.48. We can get the value using coef(m0)[1] which extracts the the first coefficient from our null model m0.

Now let’s map out the relationship of that intercept to the probability of a voluntary departure.

You’ll remember from the previous post that the sum of all of the stuff on the right part of the equation is equal to the log of the odds.

\[ log(\frac{p}{1-p}) = x \] where \(p\) is the probability of getting a 1 and \(1-p\) is the probability of getting 0.

This equation means we can use the value of \(x\) (which is just the intercept in the null model) to solve for the probability \(p\).

I’ve included the math for solving for p given the total value of x at the end of this post but the punchline is that\[p = \frac{e^x}{1 + e^x} \]

Let’s create a function that does thatwork for us so we can easily calculate the probability of getting a “1” given our results for x.

invlogit <- function(x) {exp(x)/(1+exp(x))}

In the null model, our value for x for every person is just the intercept value. Let’s plug in our intercept value and see what it returns

x<- coef(m0)[1] %>% as.numeric() #extract the intercept value
invlogit(x)
## [1] 0.3824545

The result is .38 which turns out to be the same as the mean of the vol_leave variable.

mean(d2$vol_leave)
## [1] 0.3824545

This makes sense because if we know nothing about someone at our company and what to estimate the probability of that person leaving within the time window represented by our data, our best guess would be the mean outcome, that is, the overall probability.

Time to add a predictor!

Model 1: Male/ Female

We saw above that there was a big differences between males and females so let’s start there.

We’ll use the same code set up as the null model but now replace ~ 1 with ~ sex, giving us a formula of vol_leave ~ sex.

In words, we are regressing voluntary departures on the variable of sex to see if that variable can help predict the likelihood of departure.

m1 <- glm(formula = vol_leave ~ sex, data = d2,  family = 'binomial')
summary(m1)
## 
## Call:
## glm(formula = vol_leave ~ sex, family = "binomial", data = d2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1243  -1.1243  -0.8084   1.2315   1.5985  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.12623    0.02584  -4.884 1.04e-06 ***
## sexMale     -0.82457    0.04081 -20.206  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14636  on 10999  degrees of freedom
## Residual deviance: 14214  on 10998  degrees of freedom
## AIC: 14218
## 
## Number of Fisher Scoring iterations: 4

Interpreting Model 1

In addition to our intercept, we now have a coefficient for the impact of male v. female. We only have 2 values here for sex so R went ahead and assigned Male to a value of 1 and Female to 0; we know this because it put Male next to the coefficient label. This assignment was arbitrary though and has no impact on the ultimate results of the model.

We also see that our p-value for that predictor is <2e-16, way lower than the p < .05 standard cutoff for statistical significance. Bottom line, this predictor matters.

We can see what it’s telling us by comparing the estimated probability of leaving for males v. females.

Let’s first get our total value of the right hand side of our regression equation for males and then plug in that value to our invlogit function to get our probability outcome.

To do this, we add the value of the intercept plus the value of the predictor for males (1*-.82). The key here is that we are multiplying that coefficient by 1 because males were coded as 1s in the model.

## again just adding the as.numeric to drop the label the R was holding onto in the output
invlogit(coef(m1)[1] + 1*coef(m1)[2]) %>% as.numeric()
## [1] 0.2787247

The result tells us that males would be expected to leave with a probability of .28.

Now let’s plug in the model values for females. We’ll again start with the intercept of -.13. However, females were coded as 0 for the second coefficient. This means we need to multiply that second coefficient by 0 so we end up with -.13 + 0*-.82 = .13.

invlogit(coef(m1)[1] + 0*coef(m1)[2]) %>% as.numeric()
## [1] 0.4684849

According to the model, the probability of leaving is .47 for females, much much higher.

Not surprisingly, these values turn out to be essentially the same as the base leave probabilities for male and females.

d2 %>% group_by(sex) %>% summarize(n = n(), vol_leave = sum(vol_leave)/ n)
## # A tibble: 2 x 3
##   sex        n vol_leave
##   <chr>  <int>     <dbl>
## 1 Female  6013     0.468
## 2 Male    4987     0.279

Again, with only 1 categorical predictor in our model we haven’t gained much using the logistic yet but we are reinforcing our understanding of how we link the coefficients to probabilities.

Remember the basic steps:

  • Generate the model

  • Multiply the coefficients in the model by corresponding variable values of interest

  • Sum those up

  • Run our inverted logit function on that sum to generate the probability outcome

Now we can start adding more predictors to see how using logistic regression pays off.

Model 2: Sales

Let’s add our sales variable to the mix. All we need to do is add + sales to the formula.

m2 <- glm(formula = vol_leave ~ sex + sales, data = d2,  family = 'binomial')
summary(m2)
## 
## Call:
## glm(formula = vol_leave ~ sex + sales, family = "binomial", data = d2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4847  -0.9808  -0.6731   1.2586   1.7866  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.48189    0.02970  -16.23   <2e-16 ***
## sexMale     -0.88748    0.04254  -20.86   <2e-16 ***
## sales        1.18043    0.04420   26.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14636  on 10999  degrees of freedom
## Residual deviance: 13478  on 10997  degrees of freedom
## AIC: 13484
## 
## Number of Fisher Scoring iterations: 4

Interpreting Model 2

Our sales predictor is statistically significant and strongly positive, indicating that being in sales substantially increases the likelihood of leaving. In addition, we start to see the value of the logistic approach because we can consider the impact of multiple variables at the same time.

Observe too that the the male/female predictor is still significant and the coefficient is close to it’s value from model 1. Informally, this means that the introduction of the sales variable didn’t have a huge impact on that predictor. In short, adding the sales predictor didn’t destablize our interpretation of the other variable.

How big of an impact does sales have on the probability of leaving?

We’ll plug in some values, starting with females in sales v. not in sales.

Just as before, we’ll multiply the second coefficient by 0 because females were coded as 0.

We’ll also multiply the third coefficient by 1 because a “1” here means a person is in the sales group.

# females in sales
invlogit(coef(m2)[1] + 0*coef(m2)[2] + 1*(coef(m2)[3])) %>% as.numeric()
## [1] 0.6678645

The probability of a female in sales leaving is an incredibly high .67.

How does that conmpare for females not in sales? To see, we’ll now multiple that sales coefficient by 0.

# female not in sales
invlogit(coef(m2)[1] + 0*coef(m2)[2] + 0*(coef(m2)[3])) %>% as.numeric()
## [1] 0.3818065

The result is .38. That’s still high but being in sales had a major impact on the result.

Now we’ll do the same comparison for males.

# male in sales
invlogit(coef(m2)[1] + 1*coef(m2)[2] + 1*(coef(m2)[3])) %>% as.numeric()
## [1] 0.4529049
# male not in sales
invlogit(coef(m2)[1] + 1*coef(m2)[2] + 0*(coef(m2)[3])) %>% as.numeric()
## [1] 0.2027215

Again, a huge difference.

Exploring the Fitted Values

Now that we have multiple predictors, our estimated probability outcomes for each of our individuals in the data will be a little more interesting.

To do this, we’ll use the fitted.values output from our model

head(m2$fitted.values)
##         1         2         3         4         5         6 
## 0.6678645 0.3818065 0.3818065 0.2027215 0.2027215 0.3818065

Each of these represents the estimated probability of leaving for that person (with the first value corresponding to our person in row 1, the second value for the person in row 2, etc.)

We can also run some summary statistics and visualizations to better understand our results

est_probs <- m2$fitted.values
hist(est_probs, breaks = 10)

quantile(est_probs) %>% round(2) #percentiles
##   0%  25%  50%  75% 100% 
## 0.20 0.20 0.38 0.45 0.67

You’ll note that the histogram is pretty chunky.

This is because we have only two predictors in our model and both of those are binary (male/ female, sales/ not sales)…..so our people can only be in one of four groups.

If we add a continous predictor to the mix, we should start to see some spreading out.

Model 3: Age

Before we add age into the model, we should first get a feel for how age might impact voluntary departures.

We’ll use the cut function to create different four age bands and then calculate the proportions of departures for each of those bands.

d2$age_cut <- cut(d2$age, breaks = 4)

d2 %>% group_by(age_cut) %>%
      summarize(n = n(), prob_depart = sum(vol_leave)/n)
## # A tibble: 4 x 3
##   age_cut         n prob_depart
##   <fct>       <int>       <dbl>
## 1 (22,31.1]    9405       0.379
## 2 (31.1,40.1]   886       0.354
## 3 (40.1,49.2]   600       0.457
## 4 (49.2,58.3]   109       0.495

We see a dip in the early ages but the general picture is an increase of leaving with age. This increase should therefore be reflected in our model results.

To add age to the model, we just add it to our formula.

m3 <- glm(formula = vol_leave ~ sex + sales + age, data = d2,  family = 'binomial')
summary(m3)
## 
## Call:
## glm(formula = vol_leave ~ sex + sales + age, family = "binomial", 
##     data = d2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5812  -0.9737  -0.6679   1.2610   1.8073  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.720920   0.102489  -7.034 2.01e-12 ***
## sexMale     -0.885640   0.042551 -20.814  < 2e-16 ***
## sales        1.180911   0.044216  26.708  < 2e-16 ***
## age          0.008633   0.003539   2.439   0.0147 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14636  on 10999  degrees of freedom
## Residual deviance: 13472  on 10996  degrees of freedom
## AIC: 13480
## 
## Number of Fisher Scoring iterations: 4

Interpreting Model 3

Our results show that the age predictor is also statistically significant with p = .0147.

The coefficient estimate looks small but remember that we’ll always multiply that coefficient by a person’s age to generate a probability outcome.

It made sense to talk about 0/1 for sales and 0/1 for males/ females but there is no such thing as zero age.

For a better reference point, we can multiply the age coefficient by the mean age of our sample.

This gives us a more impactful value of .24 in the right-side total

mean(d2$age) * coef(m3)[4]
##       age 
## 0.2378295

Let’s plug in some values to get a sense of how age impacts the estimated probability of leaving.

# female in sales, average age
invlogit(coef(m3)[1] + 0*coef(m3)[2] + 1*(coef(m3)[3]) + mean(d2$age)*coef(m3)[4]) %>% as.numeric()
## [1] 0.6677044
# female in sales, age 50
invlogit(coef(m3)[1] + 0*coef(m3)[2] + 1*(coef(m3)[3]) + 50*coef(m3)[4]) %>% as.numeric()
## [1] 0.7092316

In the case of females in sales, we get a probability increase of .04 when going from the mean age of 27 to 50. Not huge given the high probability of leaving for females in sales, but it’s something.

Now let’s look at the estimated probabilities for a second group, males not in sales.

Remember, we need to change the values of our variables to generate the right outcomes for this new comparison.

# male not in sales, average age
invlogit(coef(m3)[1] + 1*coef(m3)[2] + 0*(coef(m3)[3]) + mean(d2$age)*coef(m3)[4]) %>% as.numeric()
## [1] 0.202825
# male not in sales, age 50
invlogit(coef(m3)[1] + 1*coef(m3)[2] + 0*(coef(m3)[3]) + 50*coef(m3)[4]) %>% as.numeric()
## [1] 0.2359711

Here, we see a difference of about .03. Again, not huge but note that this increase is substantial relative to the base probability of .20 for those of average age.

Indeed, the shift from .20 to .23 is about a 15% increase in the likelihood of leaving.

Finally, let’s take a look at our fitted values. We should see a substantially richer picture now that we have introduced a continuous variable.

est_probs <- m3$fitted.values
hist(est_probs)

hist(est_probs, breaks = 20)

quantile(est_probs) %>% round(2) #percentiles
##   0%  25%  50%  75% 100% 
## 0.20 0.21 0.38 0.45 0.72

Definitely more interesting than the previous view, but still pretty chunky.

Why? Don’t forget that we still have our two categorical values and they both have a huge impact on the final estimate. In real life, such large impacts are rare but they can pop up.

Adding age spreads those estimated probabilities out a bit, but male/female and sales/no sales are still the prime drivers in our dummy dataset.

We can get a better feel for this if we sort our fitted values and plot them.

plot(sort(m3$fitted.values), main = "Fitted Values from m3")