Predictive HR Analytics: How to Build a Basic Logistic Regression Model
Overview
In our previous post, we explained the basics of logistic regression, what it is, what it does, and why you should care.
In today’s tutorial, we step you through an example logistics regression in R so you can set up your own analyses, understand the output, and start down the path of creating basic predictive models.
For our more experienced readers, please note we are focusing on just the basics of logistic regression so we won’t complicate the lesson with things like time windows or separate training and test sets. I cover those topics in depth in my extended tutorial.
The dummy dataset we will use can be downloaded here.
Loading Data
I’ll start loading some libraries like dplyr from the tidyverse to make our work easier and then load our data.
I assigned the data to d
because short names mean less typing.
Note: If you aren’t familiar with the tidy packages and philosophy, you should get familiar with the tidy packages and philosophy . It pays off in time saved.
### Load libraries
### for magrittr piping, dplyr, ggplot, etc.
### if you have trouble installing tidyverse within Rstudio
### try from within the RGui instead.
library(tidyverse)
d <- readr::read_csv("data/Sim_Turnover_Data_HR_Analytics_101_CSV (1).csv")
Variable Selection
Our first step is to take a look at the data and get a feel for what variables might matter.
Some (like age and gender) should always be part of your exploration and analysis, while others may emerge after your descriptive analysis.
The key idea is that you want to approach your modeling with some general understanding of your data and reasonably related factors. If you skip this step, you will end up throwing a lot of garbage into the model and you’ll get a lot of garbage out.
Our goal is to model the likelihood of a voluntary departure so we’ll keep things simple and try to identify which of our variables are correlated with the outcome.
The vol_leave
outcome variable is coded as 0/1 so we’ll look at the proportions of quitting against the other variables.
Let’s first look at the roles we have.
table(d$role) %>% barplot(border = NA, main = "Roles in Dataset", col = "#2334A6")
We have a ton of individual contributors and managers and only a handful of the more senior roles.
The career dynamics for those senior positions are likely quite different, so let’s drop those from our current analysis using the filter
function from dplyr and assign that filtered data to a new dataframe, d2; if you want to look at turnover for Director and up positions, do that separately.
### filter down to just ind contr and managers and assign to new dataframe
d2 <- d %>% dplyr::filter(role %in% c("Ind", "Manager"))
Now we can take a look at the proportions of departures by some of our other variables.
We’ll start with a males/females comparison using the table
and prop.table
functions along with the pipe function %>%
from the magrittr package.
table(d2$sex, d2$vol_leave) %>% prop.table(1) %>% round(2)
##
## 0 1
## Female 0.53 0.47
## Male 0.72 0.28
Females have a voluntary departure of 47% v. 28% for males, a major difference. We’ll be sure to keep that as a predictor.
Now let’s see if we have any evidence of differences based on the business area.
To do this, we’ll first split by business area and then compare the departure percentages across areas similar to do what we did above.
We could use the table
command but let’s try another way using some dplyr functions instead.
d2 %>% group_by(area) %>%
dplyr::summarize(n = n(), perc_vol_leave = sum(vol_leave)/n) %>%
dplyr::mutate(perc_vol_leave = round(perc_vol_leave, 2))
## # A tibble: 5 x 3
## area n perc_vol_leave
## <chr> <int> <dbl>
## 1 Accounting 1595 0.3
## 2 Finance 1662 0.31
## 3 Marketing 2233 0.28
## 4 Other 2173 0.31
## 5 Sales 3337 0.570
Huge difference for those in sales too.
We’ll keep that one as well but let’s simplify that a little. Instead of creating a whole set of dummy variables for the different areas, we’ll make a new, single variable that codes whether someone is in Sales (1) or not in Sales (0).
d2$sales <- ifelse(d2$area == "Sales", 1, 0)
Time to make some models.
Model 0: The Null Model
We begin with an empty model (or the “null” model) which is a model with no predictors, just the intercept. Understanding the results of a model with no predictors first reinforces the basic concepts of the logistic model and will be key to understanding our later models with predictors.
R Code for the Null Model
- We use the function
glm
which stands for the general linear model - We are “predicting”
vol_leave
here but without any predictors yet. We write that using the formulavol_leave ~ 1
where the~1
just tells R that we only want the intercept. Note: In the later models we’ll replace the~1
with predictor variables. - We set
data = d2
to tell R that our data is in thed2
object (the dataset with just managers and individual contributors). - We set
family = "binomial"
to tell R that we want a logistic regression (i.e. using the logit linking function). - We’ll store the new model result in
m0
(for the null model) although we could name it whatever we want.
m0 <- glm(formula = vol_leave ~ 1, data = d2, family = 'binomial')
summary(m0)
##
## Call:
## glm(formula = vol_leave ~ 1, family = "binomial", data = d2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9818 -0.9818 -0.9818 1.3865 1.3865
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.47914 0.01962 -24.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14636 on 10999 degrees of freedom
## Residual deviance: 14636 on 10999 degrees of freedom
## AIC: 14638
##
## Number of Fisher Scoring iterations: 4
Interpreting the Null Model
We’ll focus on the intercept which is our first (and only) coefficient in the null model. We’ll ignore the significance indicator for the intercept because it’s not really meaningful for the intercept but it’ll be critical for interpreting predictors in the next models.
Our intercept value -.48. We can get the value using coef(m0)[1]
which extracts the the first coefficient from our null model m0
.
Now let’s map out the relationship of that intercept to the probability of a voluntary departure.
You’ll remember from the previous post that the sum of all of the stuff on the right part of the equation is equal to the log of the odds.
\[ log(\frac{p}{1-p}) = x \] where \(p\) is the probability of getting a 1 and \(1-p\) is the probability of getting 0.
This equation means we can use the value of \(x\) (which is just the intercept in the null model) to solve for the probability \(p\).
I’ve included the math for solving for p given the total value of x at the end of this post but the punchline is that\[p = \frac{e^x}{1 + e^x} \]
Let’s create a function that does thatwork for us so we can easily calculate the probability of getting a “1” given our results for x.
invlogit <- function(x) {exp(x)/(1+exp(x))}
In the null model, our value for x for every person is just the intercept value. Let’s plug in our intercept value and see what it returns
x<- coef(m0)[1] %>% as.numeric() #extract the intercept value
invlogit(x)
## [1] 0.3824545
The result is .38 which turns out to be the same as the mean of the vol_leave
variable.
mean(d2$vol_leave)
## [1] 0.3824545
This makes sense because if we know nothing about someone at our company and what to estimate the probability of that person leaving within the time window represented by our data, our best guess would be the mean outcome, that is, the overall probability.
Time to add a predictor!
Model 1: Male/ Female
We saw above that there was a big differences between males and females so let’s start there.
We’ll use the same code set up as the null model but now replace ~ 1
with ~ sex
, giving us a formula of vol_leave ~ sex
.
In words, we are regressing voluntary departures on the variable of sex to see if that variable can help predict the likelihood of departure.
m1 <- glm(formula = vol_leave ~ sex, data = d2, family = 'binomial')
summary(m1)
##
## Call:
## glm(formula = vol_leave ~ sex, family = "binomial", data = d2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1243 -1.1243 -0.8084 1.2315 1.5985
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.12623 0.02584 -4.884 1.04e-06 ***
## sexMale -0.82457 0.04081 -20.206 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14636 on 10999 degrees of freedom
## Residual deviance: 14214 on 10998 degrees of freedom
## AIC: 14218
##
## Number of Fisher Scoring iterations: 4
Interpreting Model 1
In addition to our intercept, we now have a coefficient for the impact of male v. female. We only have 2 values here for sex
so R went ahead and assigned Male to a value of 1 and Female to 0; we know this because it put Male next to the coefficient label. This assignment was arbitrary though and has no impact on the ultimate results of the model.
We also see that our p-value for that predictor is <2e-16, way lower than the p < .05
standard cutoff for statistical significance. Bottom line, this predictor matters.
We can see what it’s telling us by comparing the estimated probability of leaving for males v. females.
Let’s first get our total value of the right hand side of our regression equation for males and then plug in that value to our invlogit function to get our probability outcome.
To do this, we add the value of the intercept plus the value of the predictor for males (1*-.82). The key here is that we are multiplying that coefficient by 1 because males were coded as 1s in the model.
## again just adding the as.numeric to drop the label the R was holding onto in the output
invlogit(coef(m1)[1] + 1*coef(m1)[2]) %>% as.numeric()
## [1] 0.2787247
The result tells us that males would be expected to leave with a probability of .28.
Now let’s plug in the model values for females. We’ll again start with the intercept of -.13. However, females were coded as 0 for the second coefficient. This means we need to multiply that second coefficient by 0 so we end up with -.13 + 0*-.82 = .13.
invlogit(coef(m1)[1] + 0*coef(m1)[2]) %>% as.numeric()
## [1] 0.4684849
According to the model, the probability of leaving is .47 for females, much much higher.
Not surprisingly, these values turn out to be essentially the same as the base leave probabilities for male and females.
d2 %>% group_by(sex) %>% summarize(n = n(), vol_leave = sum(vol_leave)/ n)
## # A tibble: 2 x 3
## sex n vol_leave
## <chr> <int> <dbl>
## 1 Female 6013 0.468
## 2 Male 4987 0.279
Again, with only 1 categorical predictor in our model we haven’t gained much using the logistic yet but we are reinforcing our understanding of how we link the coefficients to probabilities.
Remember the basic steps:
Generate the model
Multiply the coefficients in the model by corresponding variable values of interest
Sum those up
Run our inverted logit function on that sum to generate the probability outcome
Now we can start adding more predictors to see how using logistic regression pays off.
Model 2: Sales
Let’s add our sales variable to the mix. All we need to do is add + sales
to the formula.
m2 <- glm(formula = vol_leave ~ sex + sales, data = d2, family = 'binomial')
summary(m2)
##
## Call:
## glm(formula = vol_leave ~ sex + sales, family = "binomial", data = d2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4847 -0.9808 -0.6731 1.2586 1.7866
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.48189 0.02970 -16.23 <2e-16 ***
## sexMale -0.88748 0.04254 -20.86 <2e-16 ***
## sales 1.18043 0.04420 26.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14636 on 10999 degrees of freedom
## Residual deviance: 13478 on 10997 degrees of freedom
## AIC: 13484
##
## Number of Fisher Scoring iterations: 4
Interpreting Model 2
Our sales predictor is statistically significant and strongly positive, indicating that being in sales substantially increases the likelihood of leaving. In addition, we start to see the value of the logistic approach because we can consider the impact of multiple variables at the same time.
Observe too that the the male/female predictor is still significant and the coefficient is close to it’s value from model 1. Informally, this means that the introduction of the sales variable didn’t have a huge impact on that predictor. In short, adding the sales predictor didn’t destablize our interpretation of the other variable.
How big of an impact does sales have on the probability of leaving?
We’ll plug in some values, starting with females in sales v. not in sales.
Just as before, we’ll multiply the second coefficient by 0 because females were coded as 0.
We’ll also multiply the third coefficient by 1 because a “1” here means a person is in the sales group.
# females in sales
invlogit(coef(m2)[1] + 0*coef(m2)[2] + 1*(coef(m2)[3])) %>% as.numeric()
## [1] 0.6678645
The probability of a female in sales leaving is an incredibly high .67.
How does that conmpare for females not in sales? To see, we’ll now multiple that sales coefficient by 0.
# female not in sales
invlogit(coef(m2)[1] + 0*coef(m2)[2] + 0*(coef(m2)[3])) %>% as.numeric()
## [1] 0.3818065
The result is .38. That’s still high but being in sales had a major impact on the result.
Now we’ll do the same comparison for males.
# male in sales
invlogit(coef(m2)[1] + 1*coef(m2)[2] + 1*(coef(m2)[3])) %>% as.numeric()
## [1] 0.4529049
# male not in sales
invlogit(coef(m2)[1] + 1*coef(m2)[2] + 0*(coef(m2)[3])) %>% as.numeric()
## [1] 0.2027215
Again, a huge difference.
Exploring the Fitted Values
Now that we have multiple predictors, our estimated probability outcomes for each of our individuals in the data will be a little more interesting.
To do this, we’ll use the fitted.values
output from our model
head(m2$fitted.values)
## 1 2 3 4 5 6
## 0.6678645 0.3818065 0.3818065 0.2027215 0.2027215 0.3818065
Each of these represents the estimated probability of leaving for that person (with the first value corresponding to our person in row 1, the second value for the person in row 2, etc.)
We can also run some summary statistics and visualizations to better understand our results
est_probs <- m2$fitted.values
hist(est_probs, breaks = 10)
quantile(est_probs) %>% round(2) #percentiles
## 0% 25% 50% 75% 100%
## 0.20 0.20 0.38 0.45 0.67
You’ll note that the histogram is pretty chunky.
This is because we have only two predictors in our model and both of those are binary (male/ female, sales/ not sales)…..so our people can only be in one of four groups.
If we add a continous predictor to the mix, we should start to see some spreading out.
Model 3: Age
Before we add age into the model, we should first get a feel for how age might impact voluntary departures.
We’ll use the cut function to create different four age bands and then calculate the proportions of departures for each of those bands.
d2$age_cut <- cut(d2$age, breaks = 4)
d2 %>% group_by(age_cut) %>%
summarize(n = n(), prob_depart = sum(vol_leave)/n)
## # A tibble: 4 x 3
## age_cut n prob_depart
## <fct> <int> <dbl>
## 1 (22,31.1] 9405 0.379
## 2 (31.1,40.1] 886 0.354
## 3 (40.1,49.2] 600 0.457
## 4 (49.2,58.3] 109 0.495
We see a dip in the early ages but the general picture is an increase of leaving with age. This increase should therefore be reflected in our model results.
To add age to the model, we just add it to our formula.
m3 <- glm(formula = vol_leave ~ sex + sales + age, data = d2, family = 'binomial')
summary(m3)
##
## Call:
## glm(formula = vol_leave ~ sex + sales + age, family = "binomial",
## data = d2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5812 -0.9737 -0.6679 1.2610 1.8073
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.720920 0.102489 -7.034 2.01e-12 ***
## sexMale -0.885640 0.042551 -20.814 < 2e-16 ***
## sales 1.180911 0.044216 26.708 < 2e-16 ***
## age 0.008633 0.003539 2.439 0.0147 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14636 on 10999 degrees of freedom
## Residual deviance: 13472 on 10996 degrees of freedom
## AIC: 13480
##
## Number of Fisher Scoring iterations: 4
Interpreting Model 3
Our results show that the age predictor is also statistically significant with p = .0147
.
The coefficient estimate looks small but remember that we’ll always multiply that coefficient by a person’s age to generate a probability outcome.
It made sense to talk about 0/1 for sales and 0/1 for males/ females but there is no such thing as zero age.
For a better reference point, we can multiply the age coefficient by the mean age of our sample.
This gives us a more impactful value of .24 in the right-side total
mean(d2$age) * coef(m3)[4]
## age
## 0.2378295
Let’s plug in some values to get a sense of how age impacts the estimated probability of leaving.
# female in sales, average age
invlogit(coef(m3)[1] + 0*coef(m3)[2] + 1*(coef(m3)[3]) + mean(d2$age)*coef(m3)[4]) %>% as.numeric()
## [1] 0.6677044
# female in sales, age 50
invlogit(coef(m3)[1] + 0*coef(m3)[2] + 1*(coef(m3)[3]) + 50*coef(m3)[4]) %>% as.numeric()
## [1] 0.7092316
In the case of females in sales, we get a probability increase of .04 when going from the mean age of 27 to 50. Not huge given the high probability of leaving for females in sales, but it’s something.
Now let’s look at the estimated probabilities for a second group, males not in sales.
Remember, we need to change the values of our variables to generate the right outcomes for this new comparison.
# male not in sales, average age
invlogit(coef(m3)[1] + 1*coef(m3)[2] + 0*(coef(m3)[3]) + mean(d2$age)*coef(m3)[4]) %>% as.numeric()
## [1] 0.202825
# male not in sales, age 50
invlogit(coef(m3)[1] + 1*coef(m3)[2] + 0*(coef(m3)[3]) + 50*coef(m3)[4]) %>% as.numeric()
## [1] 0.2359711
Here, we see a difference of about .03. Again, not huge but note that this increase is substantial relative to the base probability of .20 for those of average age.
Indeed, the shift from .20 to .23 is about a 15% increase in the likelihood of leaving.
Finally, let’s take a look at our fitted values. We should see a substantially richer picture now that we have introduced a continuous variable.
est_probs <- m3$fitted.values
hist(est_probs)
hist(est_probs, breaks = 20)
quantile(est_probs) %>% round(2) #percentiles
## 0% 25% 50% 75% 100%
## 0.20 0.21 0.38 0.45 0.72
Definitely more interesting than the previous view, but still pretty chunky.
Why? Don’t forget that we still have our two categorical values and they both have a huge impact on the final estimate. In real life, such large impacts are rare but they can pop up.
Adding age spreads those estimated probabilities out a bit, but male/female and sales/no sales are still the prime drivers in our dummy dataset.
We can get a better feel for this if we sort our fitted values and plot them.
plot(sort(m3$fitted.values), main = "Fitted Values from m3")