Descriptive HR Analytics, Part 1: Demographics and Distributions

In all of the hubbub around analytics and machine learning, it’s easy to forget that sometimes the most impactful things are also the simplest.

In today’s tutorial we are going back to basics and answer some basic questions about our workforce. Moreover, we are going to do this with a data set that contains only 300 employees.

Why 300? I love huge data sets too but independent business surveys indicate that there are actually more companies with 250-500 employees than companies with a 1000 or more.

Choosing a small dataset will keep us focused on the kinds of questions that more common, typical businesses need to answer and keep us from getting distracted by the bells and whistles of huge datasets and machine learning.

We’ll include some additional descriptions of the processes and functions here for those that are still getting started with R.

About the Data

We are using a publically available dataset from Rich and Patalano from their data posted on Kaggle available here. The file we will use is the ‘core dataset’.

If you are new to analytics but not yet familiar with Kaggle, you should head over there and signup to get the data. It’s totally free, no spam, etc. It’s well established so fear not.

I would also STRONGLY recommend looking through the site and getting familiar with all of the wonderful tutorials there. It’s a great resource. 

Meet Your Workforce Data

Let’s do the following:

  • Load the data
  • Get the names of the fields
  • Run a basic summary
d <- read.csv('./human-resources-data-set/core_dataset.csv', stringsAsFactors = F, header = T)

dim(d)
## [1] 302  21
names(d)
##  [1] "Employee.Name"       "Employee.Number"     "State"              
##  [4] "Zip"                 "DOB"                 "Age"                
##  [7] "Sex"                 "MaritalDesc"         "CitizenDesc"        
## [10] "Hispanic.Latino"     "RaceDesc"            "Date.of.Hire"       
## [13] "Date.of.Termination" "Reason.For.Term"     "Employment.Status"  
## [16] "Department"          "Position"            "Pay.Rate"           
## [19] "Manager.Name"        "Employee.Source"     "Performance.Score"
summary(d)
##  Employee.Name      Employee.Number        State                Zip       
##  Length:302         Min.   :6.020e+08   Length:302         Min.   : 1013  
##  Class :character   1st Qu.:1.102e+09   Class :character   1st Qu.: 1901  
##  Mode  :character   Median :1.204e+09   Mode  :character   Median : 2132  
##                     Mean   :1.205e+09                      Mean   : 6705  
##                     3rd Qu.:1.401e+09                      3rd Qu.: 2421  
##                     Max.   :1.988e+09                      Max.   :98052  
##                     NA's   :1                              NA's   :1      
##      DOB                 Age            Sex            MaritalDesc       
##  Length:302         Min.   :25.00   Length:302         Length:302        
##  Class :character   1st Qu.:31.00   Class :character   Class :character  
##  Mode  :character   Median :37.00   Mode  :character   Mode  :character  
##                     Mean   :38.55                                        
##                     3rd Qu.:44.00                                        
##                     Max.   :67.00                                        
##                     NA's   :1                                            
##  CitizenDesc        Hispanic.Latino      RaceDesc        
##  Length:302         Length:302         Length:302        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  Date.of.Hire       Date.of.Termination Reason.For.Term   
##  Length:302         Length:302          Length:302        
##  Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character    Mode  :character  
##                                                           
##                                                           
##                                                           
##                                                           
##  Employment.Status   Department          Position            Pay.Rate    
##  Length:302         Length:302         Length:302         Min.   :14.00  
##  Class :character   Class :character   Class :character   1st Qu.:20.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :24.00  
##                                                           Mean   :30.72  
##                                                           3rd Qu.:43.00  
##                                                           Max.   :80.00  
##                                                           NA's   :1      
##  Manager.Name       Employee.Source    Performance.Score 
##  Length:302         Length:302         Length:302        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 

Histogram Me

Let’s start with a histogram for age. I’ll set the border to false (removing lines around the rectangles) and choose a darker red because I like how it looks.

hist(d$Age, breaks= 20, border = F, col = 'red3')

Note how much more informative this histogram is compared to a mean or a median.

Immediately we see we have huge group of somewhat younger workers, a steady decline as age increases, and then a final bump at the end. We’ll come back to that for sure.

Next, let’s look at males and females with a simple table.

table(d$Sex)
## 
##        Female   male   Male 
##      1    174      1    126

The blank field and the lower case ‘m’ show us we need to do some cleanup here. First, we’ll look at that instance where we have no information.

To do this, we’ll use the ‘which’ function to find which row has the missing value.

ind <- which(d$Sex =='')
ind
## [1] 302

It’s row 302, the very end of the data.

We’ll use the tail function to see the last few rows

tail(d)
##        Employee.Name Employee.Number State  Zip        DOB Age    Sex
## 297  Patronick, Luke      1112030979    MA 1844  2/20/1979  38   Male
## 298     Saada, Adell      1012023185    MA 2132  7/24/1986  31 Female
## 299    Szabo, Andrew      1201031324    MA 2140   5/6/1983  34   Male
## 300     True, Edward      1102024057    MA 2451  6/14/1983  34   Male
## 301 Sweetwater, Alex      1001644719    MA 2184 11/22/1966  51   Male
## 302                               NA         NA             NA       
##     MaritalDesc CitizenDesc Hispanic.Latino                  RaceDesc
## 297      Single  US Citizen              No                     Asian
## 298     Married  US Citizen              No                     White
## 299      Single  US Citizen              No                     White
## 300      Single Non-Citizen              No Black or African American
## 301      Single  US Citizen              No                     White
## 302                                                                  
##     Date.of.Hire Date.of.Termination      Reason.For.Term
## 297    11/7/2011            9/7/2015     Another position
## 298    11/5/2012                     N/A - still employed
## 299     7/7/2014                     N/A - still employed
## 300    2/18/2013           4/15/2014       medical issues
## 301    8/15/2011                     N/A - still employed
## 302                                                      
##          Employment.Status                Department
## 297 Voluntarily Terminated      Software Engineering
## 298                 Active      Software Engineering
## 299                 Active      Software Engineering
## 300 Voluntarily Terminated      Software Engineering
## 301                 Active Software Engineering     
## 302                                                 
##                         Position Pay.Rate    Manager.Name
## 297            Software Engineer    52.25 Alex Sweetwater
## 298            Software Engineer    49.25 Alex Sweetwater
## 299            Software Engineer    48.00 Alex Sweetwater
## 300            Software Engineer    45.42 Alex Sweetwater
## 301 Software Engineering Manager    27.00 Jennifer Zamora
## 302                                    NA                
##                       Employee.Source Performance.Score
## 297                Diversity Job Fair           Exceeds
## 298            Pay Per Click - Google       Fully Meets
## 299                          MBTA ads       Exceptional
## 300                Diversity Job Fair       Fully Meets
## 301 Search Engine - Google Bing Yahoo       Fully Meets
## 302

The whole row is blank so let’s drop that row by copying the d over to d2 and excluding the last row.

Note that we could just recopy it over ‘d’ but I generally don’t like to overwrite active datasets (i.e. datasets that I’m working with) because it can become difficult to keep track of things.

d2 <- d[1:301, ]

We’ll also replace that instance of ‘male’ with ‘Male’ using the ifelse function. Everything will keep the same value.

d2$Sex <- ifelse(d2$Sex == 'male', 'Male', d2$Sex)

### Check the result to be sure
table(d2$Sex)
## 
## Female   Male 
##    174    127

Love Me Some Boxplots

Now let’s look to see if we have any meaningful differences in age for males v. females.

For this, we’ll use a boxplot with our formula approach to split on sex. Remember that in a boxplot, the box represents the middle 50% of the population and the dark line represents the median.

boxplot(Age ~ Sex, data = d2, col = 'red3')

This also beats the straight comparison of means. Overall we can we have comparable medians, but a wider spread of ages for females. The little dots at the very top also tell us that females made up most of that little bump of employees in our earlier age histogram.

Let’s take a peak at race.

table(d2$RaceDesc)
## 
## American Indian or Alaska Native                            Asian 
##                                4                               31 
##        Black or African American                         Hispanic 
##                               54                                4 
##                Two or more races                            White 
##                               18                              190

Not the most diverse bunch, but of course this ignores all sorts of contributing factors including region. Let’s take a look at states.

table(d2$State)
## 
##  AL  AZ  CA  CO  CT  FL  GA  ID  IN  KY  MA  ME  MT  NC  ND  NH  NV  NY 
##   1   1   1   1   6   1   1   1   1   1 266   1   1   1   1   1   1   1 
##  OH  OR  PA  RI  TN  TX  UT  VA  VT  WA 
##   1   1   1   1   1   3   1   1   2   1

We have a ton of states here but most of our people are in MA. For our own information we’ll just break down diversity between MA and everywhere else. We’ll create a new state variable with just two categories: MA and Other

d2$State_2 <- ifelse(d2$State == 'MA', 'MA', 'Other')

table(d2$RaceDesc, d2$State_2)
##                                   
##                                     MA Other
##   American Indian or Alaska Native   3     1
##   Asian                             30     1
##   Black or African American         44    10
##   Hispanic                           4     0
##   Two or more races                 11     7
##   White                            174    16

Not super helpful. We’ll move on.

Pay and Performance

With some of the basic desciptors out of the way, let’s look at performance and pay.

table(d2$Performance.Score)
## 
##             90-day meets                  Exceeds              Exceptional 
##                       31                       28                        9 
##              Fully Meets N/A- too early to review        Needs Improvement 
##                      172                       37                       15 
##                      PIP 
##                        9

Lots of categories here. We’ll next filter down to just those who are still employed since that is the group of interest for us.

We’ll also get the percentages by wrapping our table function in the prop.table function to get proportions and then rounding it.

d3 <- d2[d2$Reason.For.Term == 'N/A - still employed',]

table(d3$Performance.Score)
## 
##             90-day meets                  Exceeds              Exceptional 
##                       18                       18                        9 
##              Fully Meets N/A- too early to review        Needs Improvement 
##                      116                       15                        7 
##                      PIP 
##                        5
round(prop.table(table(d3$Performance.Score)),2)
## 
##             90-day meets                  Exceeds              Exceptional 
##                     0.10                     0.10                     0.05 
##              Fully Meets N/A- too early to review        Needs Improvement 
##                     0.62                     0.08                     0.04 
##                      PIP 
##                     0.03

Finally, let’s look at pay. We’ll use quantiles, a histogram, and some boxplots.

quantile(d3$Pay.Rate)
##    0%   25%   50%   75%  100% 
## 14.00 20.00 24.75 53.00 80.00
hist(d3$Pay.Rate, border = F, breaks = 20, col = 'red3')

The quantile function tells us what the pay looks like for each 25% of our workforce.

The histogram is the most informative as usual, with a large cluster of people on the right and, again, a big peak at the end. Note that this seems to mirror our age distribution.

Next, let’s break down the pay by the performance measures. We’ll use the aggregate function to get the median pay for each performance category.

aggregate(Pay.Rate ~ Performance.Score, data = d3, median)
##          Performance.Score Pay.Rate
## 1             90-day meets   33.175
## 2                  Exceeds   22.250
## 3              Exceptional   35.500
## 4              Fully Meets   24.000
## 5 N/A- too early to review   22.000
## 6        Needs Improvement   53.000
## 7                      PIP   24.000

Well this is interesting.

Our ‘needs improvement’ group has the highest pay level…unexpected.

Remember though there are only 7 people in this category so one or two people in this category with extremely high pay might be skewing our results. Of course, our ‘fully meets’ group is also at the lower end, also a bit surprising. Of course, this group might be mostly made up of younger employees.

Let’s break down pay by Sex using some boxplots.

boxplot(Pay.Rate ~ Sex, data = d3, col = 'red3')

This shows us that males have more variance (bigger box around the median) if only a slightly higher median.

We can formally test for a difference in mean pay by using a two-sample t.test. This tests whether two separate samples (here males v. females) differ on some outcome of interest (pay). We’ll explicitly set paired = F to show you that we are using two separate groups (v. comparing the same people in two separate conditions).

t.test(Pay.Rate ~ Sex, data = d3, paired = F)
## 
##  Welch Two Sample t-test
## 
## data:  Pay.Rate by Sex
## t = -1.2963, df = 165.33, p-value = 0.1967
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.871198  1.632065
## sample estimates:
## mean in group Female   mean in group Male 
##             31.25019             34.36975

The p-value here is greater than .05 which indicates that there is no statistically significant difference in the pay rate for males v. females overall. As I have called out in other posts, the of statistical significance is A LOT more complicated than I am indicating here so you should be sure revisit this issue as you progress through your HR Analytics training.

Understanding Pay

Pay is always interesting so let’s work on understanding it a bit more in fictional company. Here we’ll do some basic plots and a few simple regression models to understand it a bit more.

Remembering that age is a reasonable proxy for experience, we’ll plot pay rate against age using a scatterplot

plot(d3$Age, d3$Pay.Rate, col = 'red3', pch = 19)

Hmmm. I expected to a see a clear relationship between age and pay, with pay increasing overall as the age increases but that is not what we see.

Instead, we see basically two different bands. I’m not going to bother including a linear regression line in our plot because there is nothing even remotely linear about the relationship.

Now I’m intrigued. What does pay depend on?

We’ll need to look at some other variables.

Department and position are two good candidates so let’s explore those a bit.

table(d3$Department)
## 
##             Admin Offices          Executive Office 
##                         8                         1 
##                     IT/IS         Production        
##                        29                       117 
##                     Sales      Software Engineering 
##                        26                         6 
## Software Engineering      
##                         1

One of the software engineering positions has a funky label with lots of white space. We’ll use the trim white space function to remove that.

d3$Department <- trimws(d3$Department)
table(d3$Department)
## 
##        Admin Offices     Executive Office                IT/IS 
##                    8                    1                   29 
##           Production                Sales Software Engineering 
##                  117                   26                    7

Next we’ll drop the Executive Office because that position is qualitatively different from the rest and only one person is in that position.

d4 <- d3[d3$Department != 'Executive Office',]
table(d4$Department)
## 
##        Admin Offices                IT/IS           Production 
##                    8                   29                  117 
##                Sales Software Engineering 
##                   26                    7

Now we’ll do some boxplots around pay and deparment. We’ll shrink the axis labels a bit with cex.axis command to get them in.

boxplot(Pay.Rate ~ Department, data = d4, col = 'red3', cex.axis = .8)

That’s a much clearer picture. Let’s remember of course that their are only a few software engineers and admins so be careful about interpreting those boxes. Still, the differences overall are substantial.

A Simple Linear Regression Model

Finally, we’ll end our lesson today with a linear regression model. For those that are not familiar with linear regression, at a high level it provides a method for measuring whether a independent variables can account for differences in a single outcome (dependent) variable.

I’ll write up a more detailed introduction to linear regression in a future post, but this code should be enough to get you started for now and be able to explore other resources to increase your learning.

Here, we want to account for differences in pay.

We’ll include department as well as sex, age, and race to see if any of those help us account for pay differences in our workforce.

We can do this with the lm (linear model) function along with a formula approach to create the model.

We’ll also assign it to a new object in case we want to explore the model results in greater depth.

m1 <- lm(Pay.Rate ~ Department + Sex + Age + RaceDesc, data = d4)

summary(m1)
## 
## Call:
## lm(formula = Pay.Rate ~ Department + Sex + Age + RaceDesc, data = d4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.685  -4.909  -1.424   1.873  35.924 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       28.25237    6.59932   4.281 3.06e-05 ***
## DepartmentIT/IS                   13.07515    3.95103   3.309 0.001136 ** 
## DepartmentProduction              -7.12956    3.62447  -1.967 0.050757 .  
## DepartmentSales                   25.25324    4.04075   6.250 3.04e-09 ***
## DepartmentSoftware Engineering    18.38355    5.07769   3.620 0.000385 ***
## SexMale                            0.98339    1.47801   0.665 0.506703    
## Age                                0.01873    0.08820   0.212 0.832110    
## RaceDescAsian                     -0.52776    5.42480  -0.097 0.922611    
## RaceDescBlack or African American  2.31674    5.17839   0.447 0.655149    
## RaceDescHispanic                  17.81807    7.54999   2.360 0.019376 *  
## RaceDescTwo or more races         -3.14540    5.80384  -0.542 0.588541    
## RaceDescWhite                      1.11074    5.00653   0.222 0.824683    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.779 on 175 degrees of freedom
## Multiple R-squared:  0.6427, Adjusted R-squared:  0.6202 
## F-statistic: 28.62 on 11 and 175 DF,  p-value: < 2.2e-16

We have a ton of output here, but as a first pass, we see that most of our variables have p values that vastly exceed .05. This means they are statisticall non-significant.

Overall, we see that that Age and Sex don’t seem to impact the results. In addition, we note that race doesn’t seem to matter either.

Of course, the hispanic group factor is significant in the output, but there are only 3 individuals in that group. For our final model, we’ll drop race because the membership in each group is so sparse. In other cases, we might wish to combine those in smaller groups into a single larger group to make more reasonable comparisons (e.g. white v. non-white)

We’ll keep Age and Sex in our final model too because those are two factors that need to be accounted for in the HR world regardless. Finally, we’ll keep department because it clearly matters.

Our final model will therefore be the following:

m2 <- lm(Pay.Rate ~ Department + Sex + Age, data = d4)

summary(m2)
## 
## Call:
## lm(formula = Pay.Rate ~ Department + Sex + Age, data = d4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.020  -5.569  -1.246   1.326  37.266 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    29.41913    4.56903   6.439 1.06e-09 ***
## DepartmentIT/IS                13.32485    4.00602   3.326 0.001067 ** 
## DepartmentProduction           -7.08869    3.67818  -1.927 0.055526 .  
## DepartmentSales                24.64643    4.06322   6.066 7.56e-09 ***
## DepartmentSoftware Engineering 18.31329    5.16342   3.547 0.000497 ***
## SexMale                         1.68280    1.47925   1.138 0.256798    
## Age                             0.01185    0.08932   0.133 0.894563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.952 on 180 degrees of freedom
## Multiple R-squared:  0.6194, Adjusted R-squared:  0.6067 
## F-statistic: 48.82 on 6 and 180 DF,  p-value: < 2.2e-16

Overall then, our final model shows a substantial impact of department on pay, accounting for 60% of the total variance. This is of course an artificial dataset but the steps we have detailed nonetheless provide the foundation for your own exploration. 

Conclusion

In this tutorial we stepped you through a number of basic steps for exploring your HR data, including basic visualizations, simple steps for breaking down important data elements, and a basic linear regression model helping understand differences in pay. This analysis also helped rule out some common questions around pay disparities by race and sex in this sample dataset.

What matters of course is the process. Applying these kinds of analyses to your own HR data should be similarly informative and help you better understand your workforce even if you work at mid-sized company. Analytics is not just for the huge corporations of the world.



Contact Us

Yes, I would like to receive newsletters from HR Analytics 101.