Practical Introduction to Standard Deviations for Human Capital, Part 2
Applications with real salary data
Overview
In Part 1, we explained what the standard deviation is and showed you how it can be applied to HR data such as unscheduled days off. Here in Part 2, we apply your new knowledge to some real salary data.
By the end of this tutorial, you will be able to:
- Download and perform some basic data cleaning steps for external salary data
- Apply the standard deviation (SD) to your analyses with confidence
- Understand SDs in context using histograms to visualize your data
Preliminaries: Download Load Employee Data
Let’s first get some data from the Louisville, KY metro employees database.
filename = "emp.txt" #what we will name the downloaded file
download.file(url = "http://api.louisvilleky.gov/api/File/DownloadFile?fileName=SalaryData.txt",
destfile = filename) # downloading the file and saving it
datafile <- paste(getwd(), "/", filename, sep = "") #append file location and filename without spaces
emp <- read.delim(datafile) #reading the file we downloaded
Preprocessing
Getting the Data
In earlier, less techinical posts (post 1 and post 2), we used some basic preprocessing techniques to focus in and get a quick picture of our data.
emp <- emp[emp$CalendarYear == 2014,] #filtering to just 2014
pol <- emp[emp$Department == "Louisville Metro Police", ] #filter to just police
summary(pol$AnnualRate) # quick check for crazy values or missing data.
pol$Department <- factor(pol$Department) #resetting to dump empty Department levels
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12440 41890 50340 49920 55100 175100
Checking Number of Observations
Before we start throwing statistics all over the place, let’s first make sure we have a sufficient number of observations for the jobs we will look at. This won’t be an issue with a metropolitan police department, but you should always check this first. Always.
jobs <- data.frame(table(pol$JobTitle))# count number of people with a given job
jobs <- jobs[order(jobs$Freq, decreasing = TRUE),] # sort by frequency
head(jobs, 10) # take a look at the first 10 items in the list
## Var1 Freq
## 895 Police Officer 1014
## 898 Police Sergeant 172
## 893 Police Lieutenant 70
## 896 Police Recruit 53
## 1116 Traffic Guard III 46
## 1114 Traffic Guard I 39
## 1115 Traffic Guard II 31
## 603 Information Process Technician 25
## 16 Administrative Clerk-Police 20
## 283 Crime Scene Technician I 15
Selecting the Top 3 Police Roles
sel_jobs <- jobs$Var1[c(1:3)] # selecting three levels of officers
pol <- pol[as.character(pol$JobTitle) %in% sel_jobs, ] #filtering to just those selected jobs
pol$JobTitle <- factor(pol$JobTitle) # redefine the JobTitle factor levels
A Bit More Cleaning
Sometimes text elements come with extra white space that can create problems. Because R treats the “JobTitle” category as a factor here, we can use the “levels” command to puts quotes around the entire label. This will help us see any unexpected whitespace.
levels(pol$JobTitle)
## [1] "Police Lieutenant " "Police Officer "
## [3] "Police Sergeant "
Let’s eliminate that whitespace by just renaming the factor levels without it.
levels(pol$JobTitle) <- c("Police Lieutenant", "Police Officer", "Police Sergeant")
Salary Data
Now we can get a summary of annual salary. We’ll use the aggregate function to get separate summaries and standard deviations for each of our three groups.
aggregate(x = pol$AnnualRate, by = list(pol$JobTitle), FUN = summary) #Summary data split by job title
## Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
## 1 Police Lieutenant 79160 80920 83680 84410 87510 92930
## 2 Police Officer 36920 48690 50340 50340 53210 59450
## 3 Police Sergeant 60570 63900 65020 65040 66190 70950
aggregate(x = pol$AnnualRate, by = list(pol$JobTitle), FUN = sd) #standard deviation split by job title
## Group.1 x
## 1 Police Lieutenant 3757.515
## 2 Police Officer 4707.740
## 3 Police Sergeant 2227.839
Interpreting the Salary Data
A few things are clear just from the numbers. First, Lieutenants make substantially more than Officers or Sergeants. These differences are reflected in the means and medians as well as the maximum salaries within each of these positions.
There are also some differences when it comes to standard deviations though. The SD for the Sergeant salaries, for example, is less than half of that for Officers. This means that Sergeant salaries are generally similar to each other while those of Police Officers are more spread out.
In an HR setting it is helpful to look at the spread of salaries to see whether people with the same job title receive comparable salaries. There are always other elements to consider, but looking at the standard deviations is usually a great place to start.
Visualizing the Salary Data
Let’s see these distributions to get a feel for what’s really behind the numbers. We can use the histogram function from the lattice library to see a different AnnualRate histogram for each role.
We’ll also specify the break points (the bins) by using the using the sequence command. This let’s us use the same groupings and directly compare the distributions.
# install.packages("lattice") # run this if you don't have lattice installed
library(lattice)
pol$JobTitle = factor(pol$JobTitle,levels(pol$JobTitle)[c(1,3,2)]) #reordering the levels for better visual
histogram(x = ~AnnualRate| JobTitle, data = pol, layout = c(1,3), col = "purple",
breaks = seq(from = 36000, to = 100000, by = 2000))
In the case of Police Officers, we see a substantial number of officers split off from the rest. That explains the larger SD. In HR, this is a strong signal to dig deeper into the extreme scores to understand the WHY behind the big differences.
The Sergeants give us something resembling a balanced, normal distribution, although you should be sure to note the more limited range of salaries for this group compared to the Officers.
Finally, for the Lieutenants, we see a sort of cutoff on the low end. This makes sense given that these individuals have a meaningfully higher rank and tend to have a lot more experience on the job as a group.
Note also the absence of overlap between the salaries of the three levels. Lieutenant salaries are substantially and visibly higher, again consistent with their experience and rank.
Summary Points
- Be sure to look at standard deviations as well as means and medians before drawing conclusions
- Plot your data using histograms to actually SEE and understand the SDs.
- When comparing multiple groups, be sure to pay attention to the range of values
- Plot your distributions on the same plot to make direct comparisons easier
Like this post?
Get our FREE Turnover Mini Course!
You’ll get 5 insight-rich daily lessons delivered right to your inbox.
In this series you’ll discover:
- How to calculate this critical HR metric
- How turnover can actually be a GOOD thing for your organization
- How to develop your own LEADING INDICATORS
- Other insightful workforce metrics to use today
There’s a bunch more too. All free. All digestible. Right to your inbox.
Yes! Sign Me Up!
Comments or Questions?
Add your comments OR just send me an email: john@hranalytics101.com
I would be happy to answer them!
Contact Us
- © 2023 HR Analytics 101
- Privacy Policy