Using Numbers to Code Categories: Introduction to Dummy Coding

For those just starting their HR Analytics journey, in can be hard to handle categorical variables. In today’s post we will explain the difference between qualitative(categorical) and quantitative variables and show you how to tackle categorical information in a regression equation using “dummy codes”.

A member of the HR Analytics community recently wrote and asked the following (note: I edited the question just a bit for this post).

“I would like to build a predictive model with sex (male v. female) included as a predictor. Should I change my data from qualitative into quantitative? For example, should code Female changed as a 1 and Male as a 2? Should I use the same pattern to every variable (as in 1, 2, 3…) or can I use a random number like: 2, 4, 6?”

To answer this question, let’s first take a step back and first get some clarity around the definitions.

At a basic level, we can divide our variables into two broad categories: quantitative variables and qualitative variables.

Quantitative Variables

A quantative variable measures the quantity or amount of something.

Examples from the HR world include the number of years at a company, number of employees, or salary.

If you can count it, it’s a quantitative variable.

Qualitative Variables

A qualitative variable represents a category or a difference in kind.

Qualitative variables include male v. female, work-at-home v. on-site, part-time v. full-time, or even Yes v. No v. No Answer.

Note that you can use a qualitative variable to create a quantitative variable by counting them (for example counting the number of males v. females in a department) but this is a form of aggregation; the underlying difference between the category of male v. female is still a difference in kind.

Qualitative Variables and Dummy Coding

With those definitions in our back pocket, we can now start to address the question.

First, we should now see that technically we cannot convert a categorical variable into quantitative variable because they are fundamentally different things. The difference between a male and female or between a college graduate and non-graduate does not come down to just a number.

But for the purposes of analytics and creating a model, we can represent those categories using a number. In the case of sex, we can use something called a dummy code (namely 0/1) to represent the categories of male and female.

Understanding and Creating Dummy Codes

To understand how dummy coding works, let’s start with some basic data: years at the company and sex (male and female).

set.seed(42)
s <- sample(c('m', 'f'), size = 200, T)

y <- numeric(200)

for (i in seq_along(s)){
y[i] <- ifelse(s[i] == 'm', rnorm(1, 4, 1), rnorm(1, 5, 1))
}

d <- data.frame(years = y, sex = s)

summary(d)
##      years       sex
##  Min.   :1.300   f:111
##  1st Qu.:3.801   m: 89
##  Median :4.532
##  Mean   :4.506
##  3rd Qu.:5.173
##  Max.   :7.460
head(d)
##      years sex
## 1 6.200965   f
## 2 6.044751   f
## 3 2.996791   m
## 4 6.848482   f
## 5 4.333227   f
## 6 5.105514   f
aggregate(years ~ sex, data = d, mean)
##   sex    years
## 1   f 4.942412
## 2   m 3.961877

Our analysis shows that the average number of years for our 200 employees is 4.5.

In addition, we also see that females have an average of 4.94 years at the company, males 3.96 years.

Now let’s create some dummy codes by assigning one group to a ‘0’ for sex and the other to a ‘1’.

In principle, it doesn’t really matter who gets a 1 and who gets a 0 (although as you will see it will impact how you interpret your model). The key for now is that we are consistent. There are packages for doing this in R (e.g. ‘dummies’) but we’ll do it by hand here.

Let’s assign males to 1 and females to 0 using the ifelse statement.

d$dummy <- ifelse(d$sex == 'm', 1, 0)

summary(d)
##      years       sex         dummy
##  Min.   :1.300   f:111   Min.   :0.000
##  1st Qu.:3.801   m: 89   1st Qu.:0.000
##  Median :4.532           Median :0.000
##  Mean   :4.506           Mean   :0.445
##  3rd Qu.:5.173           3rd Qu.:1.000
##  Max.   :7.460           Max.   :1.000
aggregate(years ~ dummy, d, mean) # checking the means
##   dummy    years
## 1     0 4.942412
## 2     1 3.961877
table(d$sex, d$dummy) # confirm mapping was done correctly
##
##       0   1
##   f 111   0
##   m   0  89

With our dummy code in place, we’ll now create a simple regression model, using employee sex (dummy coded) to predict years.

m1 <- lm(years ~ dummy, data = d)

summary(m1)
##
## Call:
## lm(formula = years ~ dummy, data = d)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -2.66181 -0.62734  0.00436  0.60054  2.74001
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   4.9424     0.0914  54.073  < 2e-16 ***
## dummy        -0.9805     0.1370  -7.156 1.58e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.963 on 198 degrees of freedom
## Multiple R-squared:  0.2055, Adjusted R-squared:  0.2015
## F-statistic: 51.21 on 1 and 198 DF,  p-value: 1.578e-11

Remember that with a regression equation, we take the intercept value and add that to the value of the regression coefficient times the independent variable value.

Using this model output to predict years for females, we take the intercept value of 4.94 and then add the value from the coefficient times our dummy code value. The females are here coded as a 0 so we end up with a prediction of 4.94 + -.98(0)…or just 4.94.

Observe that this is the same as the mean value for females that we saw earlier.

In contrast, for males, we take the intercept value of 4.94 and add (-.98 * 1) because males are dummy coded as a 1. The result of 3.96, again equal to the mean for males.

The big take away?

The intercept turns out to be equal to the mean of females because we coded them as 0s for the dummy variable. The coefficient then represents the deviation of males from the female baseline.

Reversing the Dummy Code Values

To reinforce this point, we’ll reverse the dummy coding with males now as a 0 and females as a 1.

What do you think the intercept will be this time?

d$dummy_2 <- ifelse(d$dummy == 1, 0, 1) # just flipping the dummy coding

m2 <- lm(years ~ dummy_2, data = d)

summary(m2)
##
## Call:
## lm(formula = years ~ dummy_2, data = d)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -2.66181 -0.62734  0.00436  0.60054  2.74001
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   3.9619     0.1021  38.813  < 2e-16 ***
## dummy_2       0.9805     0.1370   7.156 1.58e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.963 on 198 degrees of freedom
## Multiple R-squared:  0.2055, Adjusted R-squared:  0.2015
## F-statistic: 51.21 on 1 and 198 DF,  p-value: 1.578e-11

If you said 3.96 you’re right. The intercept equals the mean for the baseline case of 0 which is now male.

Do I Need to Choose 0 and 1?

Technically, you could choose a different value but this creates a problem of interpretabilty.

For example, what if we chose to assign 1 to females and 2 to males as our coding?

d$dummy_3 <- ifelse(d$dummy == 1, 1, 2) # just flipping the dummy coding

m3 <- lm(years ~ dummy_3, data = d)

summary(m3)
##
## Call:
## lm(formula = years ~ dummy_3, data = d)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -2.66181 -0.62734  0.00436  0.60054  2.74001
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   2.9813     0.2237  13.329  < 2e-16 ***
## dummy_3       0.9805     0.1370   7.156 1.58e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.963 on 198 degrees of freedom
## Multiple R-squared:  0.2055, Adjusted R-squared:  0.2015
## F-statistic: 51.21 on 1 and 198 DF,  p-value: 1.578e-11

As you can see, all the significance tests are the same as the previous models, but the intercept makes no sense.

A zero represents nothing in our data so we have lost the ability to immediately interpret the intercept.

In addition, while the coefficient value is the same, we need to multiple it by 1 for one group but 2 by another.

Although the value itself represents the difference between the male and female age means, there is no intuitive meaning here. Not a good move.

The bottom line? Use 1s and 0s to keep things simple and directly interpretable.

Summary and Recommendations

In this post we talked about the difference between qualitative and quantitative variables. In addition, we showed you how to use dummy coding to represent a two-category qualitative variable.

The summary recommendation is to think about the question you are asking and what dummy coding reference point makes sense; assign 0 to the baseline that means something for your question.

In our next post, we’ll address dummy coding for 3 or more categories.