## How to Predict Categorical Outcomes: Logistic Regression Fundamentals

Logistic regression is an essential tool in your analytics toolkit. It’s great for basic predictive models like predicting employee turnover and also for investigating basic relationships in your data.

In today’s post we’ll tell you what logistic regression is, what it does, and why you should care.

The goal is to help you get started, not shoehorn a semester’s worth of graduate-level stats into a single post. Once you get rolling, you’ll have a better foundation for increasing your understanding. Remember, learn by doing.

In the tutorial follow-up to this post, we show you step-by-step how to do a basic logistic regression in R to predict employee turnover, including interpreting your results.

## What is Logistic Regression?

Logistic regression is a modeling method in which we use information from one or more variables to predict a binary outcome, that is, an outcome with only two possibilities (coded as 0/1 with 1 meaning the event occurred).

Everyday examples of binary outcomes in HR analytics include stay/ depart, promoted/not promoted, or high-potential/ not high-potential.

In practice we might use logistic regression to predict the probability of a person staying with the company (0) or leaving voluntarily (1) using variables like age, company tenure, and employee engagement as the predictors.

## How Does Logistic Regression Work?

Logistic regression is part of a family of models in which inputs values (X) are combined linearly using weights to predict an outcome (Y).

These models have the general form of $$y = mx + b$$ that you might remember from high school or university.

The input values (X) are predictor variables such as age or engagement and are commonly referred to as independent variables.

We multiply each of these input variables by a unique weight (called a beta weight) and then add everything up to get our prediction for the outcome Y.

The thing we are trying to predict is called the dependent variable because its value “depends on” the independent predictor variables.

We can (very) roughly and informally summarize this as the following:

$\begin{equation} Predicted \ y = Intercept + Weight_1x_1 + Weight_2x_2… \end{equation}$

In standard linear regression we have a continuous range of outcomes (not a series 0s and 1s) so we can just get the sum of everything on the right to get our predicted value directly.

With logistic regression, however, we need to take one extra step. Remember that here we have only 0s and 1s as outcomes but our goal is to predict the probability of the “1” outcome.

If we just added up everything on the right side of our equation we could end up getting values that fall outside of our required [0,1] probability range. We would also end up violating certain statistical assumptions about the distribution of our errors.

What we need then is some kind link function that transforms whatever sum we get on the right to a probability value between 0 and 1.

It’s called the link function specifically because it uses some kind of transformation to link the linear combination of predictors on the right side with the outcome on the left side.

The general form of the link function is the following:

$\begin{equation} y_i = m(\beta_0 + \beta_1x_i + \beta_2x_i…) \end{equation}$

where $$m$$ represents the link function operating on our sum of linear inputs and $$y_i$$ represents the probability of the outcome for person $$i$$.

But what function do we need?

## The Logistic Function: Don’t Panic

This magic function is the logistic function:

$\begin{equation} \frac{e^x}{1+e^x} \end{equation}$

In logistic regression, we use the right-hand side of our logistic regression model results to give us the beta weights $$\beta$$ (and ultimately the summed values) we need to plug into the logistic function and generate our prediction.

$\begin{equation} \frac{e^{\beta_0 + \beta_1x_1 + \beta_2x_2…}}{1 +e^{\beta_0 + \beta_1x_1 + \beta_2x_2…} } \end{equation}$

If you look carefully you’ll see that in this equation, we still have our series of input values and beta weights just as we did before in our logistic equation above.

The top piece of the logistic function $$e^{\beta_0 + \beta_1x_1 + \beta_2x_2…}$$ gives us the odds of the event happening.

The bottom piece $$1 +e^{\beta_0 + \beta_1x_1 + \beta_2x_2…}$$ is just 1 + those odds.

Putting this all together, we have the the following relationship and can generate the predicted probability $$p$$ of the outcome:

$\begin{equation} p = \frac{odds}{1+odds} = \frac{e^{\beta_0 + \beta_1x_1 + \beta_2x_2…}}{1 +e^{\beta_0 + \beta_1x_1 + \beta_2x_2…}} \end{equation}$

The upshot of the whole process, then, is that the result of the basic logistic formulation $$\frac{e^x}{1+e^x}$$ is equal to the probability of the “1” outcome that we are trying to predict for each observation in our data.

We can therefore use the results of our logistic regression model to calculate the probability of the outcome for an individual given their predictor values.

We’ll show you how to handle all of mechanics in R so you won’t need to manually implement this. Bottom line, don’t panic.

Here is what you really need to know for now:

1. The basic logistic function can transform any value between $$-\infty$$ and $$+\infty$$ to something between 0 and 1 as we can see in the figure below:
x <- as.numeric(seq(-10, 10, len = 1000))

y <- exp(x)/(1 + exp(x))

plot(x, y, type = "l", ylab = "exp^x/(1 + exp^x)", main = "Basic Logistic Function") Get a feel for this equation and see for yourself….open up R, grab a calculator, open Excel or whatever, and just plug in a few values into the basic logistic formula and plot the results.

1. The job of the logistic regression model is to figure out the $$\beta$$ (beta) values that give us the most accurate set of predictions given the input values.

2. If we take the logistic regression model results and plug them into the logistic function, we get the predicted probability of the outcome for a given person.

3. You can handle all of this with some simple R code

We have some starter code below but we’ll discuss this in more detail in our follow-up tutorial.

## Starter Code

If you can’t wait, start experimenting with logistic regression now by first downloading this starter sample data and then running the following model predicting voluntary departures with a single variable, performance level:

Note: Be sure to change the read_csv function location to fit the location of the data.

d <- readr::read_csv("data/Sim_Turnover_Data_HR_Analytics_101_CSV (1).csv")

m1 <- glm(vol_leave ~ perf, data = d,  family = 'binomial')
summary(m1)
##
## Call:
## glm(formula = vol_leave ~ perf, family = "binomial", data = d)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.1107  -0.9453  -0.9453   1.2456   1.6156
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.40374    0.07684  -18.27   <2e-16 ***
## perf         0.41492    0.03326   12.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 14770  on 11110  degrees of freedom
## Residual deviance: 14611  on 11109  degrees of freedom
## AIC: 14615
##
## Number of Fisher Scoring iterations: 4

Note that here  glm(…)  stands for the “General Linear Model”.

The  family = ‘binomial’  tells R that we want to run a logistic regression.

Again, more details to come in the follow-up tutorial.

## Why You Should Love Logistic Regression

### 1. Easy to Implement

Unlike many of the other machine learning/ predictive modeling tools used today, logistic regression is easy to set up. All you really need is a categorical outcome and a few plausible predictor variables to get started. Yes, there is always more to learn but it’s a great hands-on way to get started with predictive modeling. Despite it’s simplicity, logistic regression is still very powerful and can often help you get reasonable predictions very quickly.

### 2. Easy to Run

As machine learning techniques get more sophisticated, they often require more computational resources. But with logistic regresssion you should be able to run your models on a standard laptop and get a result in just a few seconds given the typical size of HR Analytics data.

### 3. Easy to Explain

More sophisticated predictive modeling techniques can be a black box when it comes to explaining your results. Sure, the VP of HR would love to maximize model fit, but if you can’t explain how it works or what factors seem to really matter, you probably won’t get anywhere.

With logistic regression, however, you can point to the predictor variables and you can point to the weights to clearly explain what mattered and what didn’t, at least within the model. That will get the underlying “why” conversations going and help others see the value you can bring to people analytics conversations. To be sure, all models are simplifications, but logistic regression models are directly interpretable ones.

## Coming Up Next

In our next post, we’ll give a step-by-step tutorial logistic regress and walk you through the basic from building the initial model to understanding the output.