## Overview

In Part 1, we explained what the standard deviation is and showed you how it can be applied to HR data such as unscheduled days off. Here in Part 2, we apply your new knowledge to some real salary data.

By the end of this tutorial, you will be able to:

• Download and perform some basic data cleaning steps for external salary data
• Apply the standard deviation (SD) to your analyses with confidence
• Understand SDs in context using histograms to visualize your data

Let’s first get some data from the Louisville, KY metro employees database.

``````filename = "emp.txt" #what we will name the downloaded file

datafile <- paste(getwd(), "/", filename, sep = "") #append file location and filename without spaces

## Preprocessing

### Getting the Data

In earlier, less techinical posts (post 1 and post 2), we used some basic preprocessing techniques to focus in and get a quick picture of our data.

``````emp <- emp[emp\$CalendarYear == 2014,] #filtering to just 2014

pol <- emp[emp\$Department == "Louisville Metro Police", ] #filter to just police
summary(pol\$AnnualRate) # quick check for crazy values or missing data.
pol\$Department <- factor(pol\$Department) #resetting to dump empty Department levels``````
``````##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   12440   41890   50340   49920   55100  175100``````

### Checking Number of Observations

Before we start throwing statistics all over the place, let’s first make sure we have a sufficient number of observations for the jobs we will look at. This won’t be an issue with a metropolitan police department, but you should always check this first. Always.

``````jobs <- data.frame(table(pol\$JobTitle))# count number of people with a given job
jobs <- jobs[order(jobs\$Freq, decreasing = TRUE),] # sort by frequency
head(jobs, 10) # take a look at the first 10 items in the list``````
``````##                                Var1 Freq
## 895  Police Officer                 1014
## 898  Police Sergeant                 172
## 893  Police Lieutenant                70
## 896  Police Recruit                   53
## 1116 Traffic Guard III                46
## 1114 Traffic Guard I                  39
## 1115 Traffic Guard II                 31
## 603  Information Process Technician   25
## 283  Crime Scene Technician I         15``````

### Selecting the Top 3 Police Roles

``````sel_jobs <- jobs\$Var1[c(1:3)] # selecting three levels of officers
pol <- pol[as.character(pol\$JobTitle) %in% sel_jobs, ] #filtering to just those selected jobs
pol\$JobTitle <- factor(pol\$JobTitle) # redefine the JobTitle factor levels``````

### A Bit More Cleaning

Sometimes text elements come with extra white space that can create problems. Because R treats the “JobTitle” category as a factor here, we can use the “levels” command to puts quotes around the entire label. This will help us see any unexpected whitespace.

``levels(pol\$JobTitle)``
``````## [1] "Police Lieutenant             " "Police Officer                "
## [3] "Police Sergeant               "``````

Let’s eliminate that whitespace by just renaming the factor levels without it.

``levels(pol\$JobTitle) <- c("Police Lieutenant", "Police Officer", "Police Sergeant")``

## Salary Data

Now we can get a summary of annual salary. We’ll use the aggregate function to get separate summaries and standard deviations for each of our three groups.

``aggregate(x = pol\$AnnualRate, by = list(pol\$JobTitle), FUN = summary) #Summary data split by job title``
``````##             Group.1 x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
## 1 Police Lieutenant  79160     80920    83680  84410     87510  92930
## 2    Police Officer  36920     48690    50340  50340     53210  59450
## 3   Police Sergeant  60570     63900    65020  65040     66190  70950``````
``aggregate(x = pol\$AnnualRate, by = list(pol\$JobTitle), FUN = sd) #standard deviation split by job title``
``````##             Group.1        x
## 1 Police Lieutenant 3757.515
## 2    Police Officer 4707.740
## 3   Police Sergeant 2227.839``````

### Interpreting the Salary Data

A few things are clear just from the numbers. First, Lieutenants make substantially more than Officers or Sergeants. These differences are reflected in the means and medians as well as the maximum salaries within each of these positions.

There are also some differences when it comes to standard deviations though. The SD for the Sergeant salaries, for example, is less than half of that for Officers. This means that Sergeant salaries are generally similar to each other while those of Police Officers are more spread out.

In an HR setting it is helpful to look at the spread of salaries to see whether people with the same job title receive comparable salaries. There are always other elements to consider, but looking at the standard deviations is usually a great place to start.

### Visualizing the Salary Data

Let’s see these distributions to get a feel for what’s really behind the numbers. We can use the histogram function from the lattice library to see a different AnnualRate histogram for each role.

We’ll also specify the break points (the bins) by using the using the sequence command. This let’s us use the same groupings and directly compare the distributions.

``````# install.packages("lattice") # run this if you don't have lattice installed
library(lattice)
pol\$JobTitle = factor(pol\$JobTitle,levels(pol\$JobTitle)[c(1,3,2)]) #reordering the levels for better visual
histogram(x = ~AnnualRate| JobTitle, data = pol, layout = c(1,3), col = "purple",
breaks = seq(from = 36000, to = 100000, by = 2000))``````

In the case of Police Officers, we see a substantial number of officers split off from the rest. That explains the larger SD. In HR, this is a strong signal to dig deeper into the extreme scores to understand the WHY behind the big differences.

The Sergeants give us something resembling a balanced, normal distribution, although you should be sure to note the more limited range of salaries for this group compared to the Officers.

Finally, for the Lieutenants, we see a sort of cutoff on the low end. This makes sense given that these individuals have a meaningfully higher rank and tend to have a lot more experience on the job as a group.

Note also the absence of overlap between the salaries of the three levels. Lieutenant salaries are substantially and visibly higher, again consistent with their experience and rank.

# Summary Points

• Be sure to look at standard deviations as well as means and medians before drawing conclusions
• Plot your data using histograms to actually SEE and understand the SDs.
• When comparing multiple groups, be sure to pay attention to the range of values
• Plot your distributions on the same plot to make direct comparisons easier

## Like this post?

### Get our FREE Turnover Mini Course!

You’ll get 5 insight-rich daily lessons delivered right to your inbox.

In this series you’ll discover:

• How to calculate this critical HR metric
• How turnover can actually be a GOOD thing for your organization
• Other insightful workforce metrics to use today

There’s a bunch more too. All free. All digestible. Right to your inbox.