Descriptive HR Analytics, Part 1: Demographics and Distributions
In all of the hubbub around analytics and machine learning, it’s easy to forget that sometimes the most impactful things are also the simplest.
In today’s tutorial we are going back to basics and answer some basic questions about our workforce. Moreover, we are going to do this with a data set that contains only 300 employees.
Why 300? I love huge data sets too but independent business surveys indicate that there are actually more companies with 250-500 employees than companies with a 1000 or more.
Choosing a small dataset will keep us focused on the kinds of questions that more common, typical businesses need to answer and keep us from getting distracted by the bells and whistles of huge datasets and machine learning.
We’ll include some additional descriptions of the processes and functions here for those that are still getting started with R.
About the Data
We are using a publically available dataset from Rich and Patalano from their data posted on Kaggle available here. The file we will use is the ‘core dataset’.
If you are new to analytics but not yet familiar with Kaggle, you should head over there and signup to get the data. It’s totally free, no spam, etc. It’s well established so fear not.
I would also STRONGLY recommend looking through the site and getting familiar with all of the wonderful tutorials there. It’s a great resource.
Meet Your Workforce Data
Let’s do the following:
- Load the data
- Get the names of the fields
- Run a basic summary
d <- read.csv('./human-resources-data-set/core_dataset.csv', stringsAsFactors = F, header = T)
dim(d)
## [1] 302 21
names(d)
## [1] "Employee.Name" "Employee.Number" "State"
## [4] "Zip" "DOB" "Age"
## [7] "Sex" "MaritalDesc" "CitizenDesc"
## [10] "Hispanic.Latino" "RaceDesc" "Date.of.Hire"
## [13] "Date.of.Termination" "Reason.For.Term" "Employment.Status"
## [16] "Department" "Position" "Pay.Rate"
## [19] "Manager.Name" "Employee.Source" "Performance.Score"
summary(d)
## Employee.Name Employee.Number State Zip
## Length:302 Min. :6.020e+08 Length:302 Min. : 1013
## Class :character 1st Qu.:1.102e+09 Class :character 1st Qu.: 1901
## Mode :character Median :1.204e+09 Mode :character Median : 2132
## Mean :1.205e+09 Mean : 6705
## 3rd Qu.:1.401e+09 3rd Qu.: 2421
## Max. :1.988e+09 Max. :98052
## NA's :1 NA's :1
## DOB Age Sex MaritalDesc
## Length:302 Min. :25.00 Length:302 Length:302
## Class :character 1st Qu.:31.00 Class :character Class :character
## Mode :character Median :37.00 Mode :character Mode :character
## Mean :38.55
## 3rd Qu.:44.00
## Max. :67.00
## NA's :1
## CitizenDesc Hispanic.Latino RaceDesc
## Length:302 Length:302 Length:302
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Date.of.Hire Date.of.Termination Reason.For.Term
## Length:302 Length:302 Length:302
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Employment.Status Department Position Pay.Rate
## Length:302 Length:302 Length:302 Min. :14.00
## Class :character Class :character Class :character 1st Qu.:20.00
## Mode :character Mode :character Mode :character Median :24.00
## Mean :30.72
## 3rd Qu.:43.00
## Max. :80.00
## NA's :1
## Manager.Name Employee.Source Performance.Score
## Length:302 Length:302 Length:302
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
Histogram Me
Let’s start with a histogram for age. I’ll set the border to false (removing lines around the rectangles) and choose a darker red because I like how it looks.
hist(d$Age, breaks= 20, border = F, col = 'red3')
Note how much more informative this histogram is compared to a mean or a median.
Immediately we see we have huge group of somewhat younger workers, a steady decline as age increases, and then a final bump at the end. We’ll come back to that for sure.
Next, let’s look at males and females with a simple table.
table(d$Sex)
##
## Female male Male
## 1 174 1 126
The blank field and the lower case ‘m’ show us we need to do some cleanup here. First, we’ll look at that instance where we have no information.
To do this, we’ll use the ‘which’ function to find which row has the missing value.
ind <- which(d$Sex =='')
ind
## [1] 302
It’s row 302, the very end of the data.
We’ll use the tail function to see the last few rows
tail(d)
## Employee.Name Employee.Number State Zip DOB Age Sex
## 297 Patronick, Luke 1112030979 MA 1844 2/20/1979 38 Male
## 298 Saada, Adell 1012023185 MA 2132 7/24/1986 31 Female
## 299 Szabo, Andrew 1201031324 MA 2140 5/6/1983 34 Male
## 300 True, Edward 1102024057 MA 2451 6/14/1983 34 Male
## 301 Sweetwater, Alex 1001644719 MA 2184 11/22/1966 51 Male
## 302 NA NA NA
## MaritalDesc CitizenDesc Hispanic.Latino RaceDesc
## 297 Single US Citizen No Asian
## 298 Married US Citizen No White
## 299 Single US Citizen No White
## 300 Single Non-Citizen No Black or African American
## 301 Single US Citizen No White
## 302
## Date.of.Hire Date.of.Termination Reason.For.Term
## 297 11/7/2011 9/7/2015 Another position
## 298 11/5/2012 N/A - still employed
## 299 7/7/2014 N/A - still employed
## 300 2/18/2013 4/15/2014 medical issues
## 301 8/15/2011 N/A - still employed
## 302
## Employment.Status Department
## 297 Voluntarily Terminated Software Engineering
## 298 Active Software Engineering
## 299 Active Software Engineering
## 300 Voluntarily Terminated Software Engineering
## 301 Active Software Engineering
## 302
## Position Pay.Rate Manager.Name
## 297 Software Engineer 52.25 Alex Sweetwater
## 298 Software Engineer 49.25 Alex Sweetwater
## 299 Software Engineer 48.00 Alex Sweetwater
## 300 Software Engineer 45.42 Alex Sweetwater
## 301 Software Engineering Manager 27.00 Jennifer Zamora
## 302 NA
## Employee.Source Performance.Score
## 297 Diversity Job Fair Exceeds
## 298 Pay Per Click - Google Fully Meets
## 299 MBTA ads Exceptional
## 300 Diversity Job Fair Fully Meets
## 301 Search Engine - Google Bing Yahoo Fully Meets
## 302
The whole row is blank so let’s drop that row by copying the d over to d2 and excluding the last row.
Note that we could just recopy it over ‘d’ but I generally don’t like to overwrite active datasets (i.e. datasets that I’m working with) because it can become difficult to keep track of things.
d2 <- d[1:301, ]
We’ll also replace that instance of ‘male’ with ‘Male’ using the ifelse function. Everything will keep the same value.
d2$Sex <- ifelse(d2$Sex == 'male', 'Male', d2$Sex)
### Check the result to be sure
table(d2$Sex)
##
## Female Male
## 174 127
Love Me Some Boxplots
Now let’s look to see if we have any meaningful differences in age for males v. females.
For this, we’ll use a boxplot with our formula approach to split on sex. Remember that in a boxplot, the box represents the middle 50% of the population and the dark line represents the median.
boxplot(Age ~ Sex, data = d2, col = 'red3')
This also beats the straight comparison of means. Overall we can we have comparable medians, but a wider spread of ages for females. The little dots at the very top also tell us that females made up most of that little bump of employees in our earlier age histogram.
Let’s take a peak at race.
table(d2$RaceDesc)
##
## American Indian or Alaska Native Asian
## 4 31
## Black or African American Hispanic
## 54 4
## Two or more races White
## 18 190
Not the most diverse bunch, but of course this ignores all sorts of contributing factors including region. Let’s take a look at states.
table(d2$State)
##
## AL AZ CA CO CT FL GA ID IN KY MA ME MT NC ND NH NV NY
## 1 1 1 1 6 1 1 1 1 1 266 1 1 1 1 1 1 1
## OH OR PA RI TN TX UT VA VT WA
## 1 1 1 1 1 3 1 1 2 1
We have a ton of states here but most of our people are in MA. For our own information we’ll just break down diversity between MA and everywhere else. We’ll create a new state variable with just two categories: MA and Other
d2$State_2 <- ifelse(d2$State == 'MA', 'MA', 'Other')
table(d2$RaceDesc, d2$State_2)
##
## MA Other
## American Indian or Alaska Native 3 1
## Asian 30 1
## Black or African American 44 10
## Hispanic 4 0
## Two or more races 11 7
## White 174 16
Not super helpful. We’ll move on.
Pay and Performance
With some of the basic desciptors out of the way, let’s look at performance and pay.
table(d2$Performance.Score)
##
## 90-day meets Exceeds Exceptional
## 31 28 9
## Fully Meets N/A- too early to review Needs Improvement
## 172 37 15
## PIP
## 9
Lots of categories here. We’ll next filter down to just those who are still employed since that is the group of interest for us.
We’ll also get the percentages by wrapping our table function in the prop.table function to get proportions and then rounding it.
d3 <- d2[d2$Reason.For.Term == 'N/A - still employed',]
table(d3$Performance.Score)
##
## 90-day meets Exceeds Exceptional
## 18 18 9
## Fully Meets N/A- too early to review Needs Improvement
## 116 15 7
## PIP
## 5
round(prop.table(table(d3$Performance.Score)),2)
##
## 90-day meets Exceeds Exceptional
## 0.10 0.10 0.05
## Fully Meets N/A- too early to review Needs Improvement
## 0.62 0.08 0.04
## PIP
## 0.03
Finally, let’s look at pay. We’ll use quantiles, a histogram, and some boxplots.
quantile(d3$Pay.Rate)
## 0% 25% 50% 75% 100%
## 14.00 20.00 24.75 53.00 80.00
hist(d3$Pay.Rate, border = F, breaks = 20, col = 'red3')
The quantile function tells us what the pay looks like for each 25% of our workforce.
The histogram is the most informative as usual, with a large cluster of people on the right and, again, a big peak at the end. Note that this seems to mirror our age distribution.
Next, let’s break down the pay by the performance measures. We’ll use the aggregate function to get the median pay for each performance category.
aggregate(Pay.Rate ~ Performance.Score, data = d3, median)
## Performance.Score Pay.Rate
## 1 90-day meets 33.175
## 2 Exceeds 22.250
## 3 Exceptional 35.500
## 4 Fully Meets 24.000
## 5 N/A- too early to review 22.000
## 6 Needs Improvement 53.000
## 7 PIP 24.000
Well this is interesting.
Our ‘needs improvement’ group has the highest pay level…unexpected.
Remember though there are only 7 people in this category so one or two people in this category with extremely high pay might be skewing our results. Of course, our ‘fully meets’ group is also at the lower end, also a bit surprising. Of course, this group might be mostly made up of younger employees.
Let’s break down pay by Sex using some boxplots.
boxplot(Pay.Rate ~ Sex, data = d3, col = 'red3')
This shows us that males have more variance (bigger box around the median) if only a slightly higher median.
We can formally test for a difference in mean pay by using a two-sample t.test. This tests whether two separate samples (here males v. females) differ on some outcome of interest (pay). We’ll explicitly set paired = F to show you that we are using two separate groups (v. comparing the same people in two separate conditions).
t.test(Pay.Rate ~ Sex, data = d3, paired = F)
##
## Welch Two Sample t-test
##
## data: Pay.Rate by Sex
## t = -1.2963, df = 165.33, p-value = 0.1967
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.871198 1.632065
## sample estimates:
## mean in group Female mean in group Male
## 31.25019 34.36975
The p-value here is greater than .05 which indicates that there is no statistically significant difference in the pay rate for males v. females overall. As I have called out in other posts, the of statistical significance is A LOT more complicated than I am indicating here so you should be sure revisit this issue as you progress through your HR Analytics training.
Understanding Pay
Pay is always interesting so let’s work on understanding it a bit more in fictional company. Here we’ll do some basic plots and a few simple regression models to understand it a bit more.
Remembering that age is a reasonable proxy for experience, we’ll plot pay rate against age using a scatterplot
plot(d3$Age, d3$Pay.Rate, col = 'red3', pch = 19)
Hmmm. I expected to a see a clear relationship between age and pay, with pay increasing overall as the age increases but that is not what we see.
Instead, we see basically two different bands. I’m not going to bother including a linear regression line in our plot because there is nothing even remotely linear about the relationship.
Now I’m intrigued. What does pay depend on?
We’ll need to look at some other variables.
Department and position are two good candidates so let’s explore those a bit.
table(d3$Department)
##
## Admin Offices Executive Office
## 8 1
## IT/IS Production
## 29 117
## Sales Software Engineering
## 26 6
## Software Engineering
## 1
One of the software engineering positions has a funky label with lots of white space. We’ll use the trim white space function to remove that.
d3$Department <- trimws(d3$Department)
table(d3$Department)
##
## Admin Offices Executive Office IT/IS
## 8 1 29
## Production Sales Software Engineering
## 117 26 7
Next we’ll drop the Executive Office because that position is qualitatively different from the rest and only one person is in that position.
d4 <- d3[d3$Department != 'Executive Office',]
table(d4$Department)
##
## Admin Offices IT/IS Production
## 8 29 117
## Sales Software Engineering
## 26 7
Now we’ll do some boxplots around pay and deparment. We’ll shrink the axis labels a bit with cex.axis command to get them in.
boxplot(Pay.Rate ~ Department, data = d4, col = 'red3', cex.axis = .8)
That’s a much clearer picture. Let’s remember of course that their are only a few software engineers and admins so be careful about interpreting those boxes. Still, the differences overall are substantial.
A Simple Linear Regression Model
Finally, we’ll end our lesson today with a linear regression model. For those that are not familiar with linear regression, at a high level it provides a method for measuring whether a independent variables can account for differences in a single outcome (dependent) variable.
I’ll write up a more detailed introduction to linear regression in a future post, but this code should be enough to get you started for now and be able to explore other resources to increase your learning.
Here, we want to account for differences in pay.
We’ll include department as well as sex, age, and race to see if any of those help us account for pay differences in our workforce.
We can do this with the lm (linear model) function along with a formula approach to create the model.
We’ll also assign it to a new object in case we want to explore the model results in greater depth.
m1 <- lm(Pay.Rate ~ Department + Sex + Age + RaceDesc, data = d4)
summary(m1)
##
## Call:
## lm(formula = Pay.Rate ~ Department + Sex + Age + RaceDesc, data = d4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.685 -4.909 -1.424 1.873 35.924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.25237 6.59932 4.281 3.06e-05 ***
## DepartmentIT/IS 13.07515 3.95103 3.309 0.001136 **
## DepartmentProduction -7.12956 3.62447 -1.967 0.050757 .
## DepartmentSales 25.25324 4.04075 6.250 3.04e-09 ***
## DepartmentSoftware Engineering 18.38355 5.07769 3.620 0.000385 ***
## SexMale 0.98339 1.47801 0.665 0.506703
## Age 0.01873 0.08820 0.212 0.832110
## RaceDescAsian -0.52776 5.42480 -0.097 0.922611
## RaceDescBlack or African American 2.31674 5.17839 0.447 0.655149
## RaceDescHispanic 17.81807 7.54999 2.360 0.019376 *
## RaceDescTwo or more races -3.14540 5.80384 -0.542 0.588541
## RaceDescWhite 1.11074 5.00653 0.222 0.824683
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.779 on 175 degrees of freedom
## Multiple R-squared: 0.6427, Adjusted R-squared: 0.6202
## F-statistic: 28.62 on 11 and 175 DF, p-value: < 2.2e-16
We have a ton of output here, but as a first pass, we see that most of our variables have p values that vastly exceed .05. This means they are statisticall non-significant.
Overall, we see that that Age and Sex don’t seem to impact the results. In addition, we note that race doesn’t seem to matter either.
Of course, the hispanic group factor is significant in the output, but there are only 3 individuals in that group. For our final model, we’ll drop race because the membership in each group is so sparse. In other cases, we might wish to combine those in smaller groups into a single larger group to make more reasonable comparisons (e.g. white v. non-white)
We’ll keep Age and Sex in our final model too because those are two factors that need to be accounted for in the HR world regardless. Finally, we’ll keep department because it clearly matters.
Our final model will therefore be the following:
m2 <- lm(Pay.Rate ~ Department + Sex + Age, data = d4)
summary(m2)
##
## Call:
## lm(formula = Pay.Rate ~ Department + Sex + Age, data = d4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.020 -5.569 -1.246 1.326 37.266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.41913 4.56903 6.439 1.06e-09 ***
## DepartmentIT/IS 13.32485 4.00602 3.326 0.001067 **
## DepartmentProduction -7.08869 3.67818 -1.927 0.055526 .
## DepartmentSales 24.64643 4.06322 6.066 7.56e-09 ***
## DepartmentSoftware Engineering 18.31329 5.16342 3.547 0.000497 ***
## SexMale 1.68280 1.47925 1.138 0.256798
## Age 0.01185 0.08932 0.133 0.894563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.952 on 180 degrees of freedom
## Multiple R-squared: 0.6194, Adjusted R-squared: 0.6067
## F-statistic: 48.82 on 6 and 180 DF, p-value: < 2.2e-16
Overall then, our final model shows a substantial impact of department on pay, accounting for 60% of the total variance. This is of course an artificial dataset but the steps we have detailed nonetheless provide the foundation for your own exploration.
Conclusion
In this tutorial we stepped you through a number of basic steps for exploring your HR data, including basic visualizations, simple steps for breaking down important data elements, and a basic linear regression model helping understand differences in pay. This analysis also helped rule out some common questions around pay disparities by race and sex in this sample dataset.
What matters of course is the process. Applying these kinds of analyses to your own HR data should be similarly informative and help you better understand your workforce even if you work at mid-sized company. Analytics is not just for the huge corporations of the world.
Contact Us
- © 2023 HR Analytics 101
- Privacy Policy