Analyzing Salary Data with R, Part 1: Essential Data Processing Techniques
Overview
Visualizing and understanding salary information is a crucial function of HR, Human Capital Analytics, and numerous other business segments. In today’s tutorial, we will introduce you to a few essential techniques for reviewing and visualizing salary data.
At the end of Part 1, you will be able to
- Quickly know what’s in your data
- Find and eliminate some common salary data quality issues
At the end of Part 2, you will be able to
- Use basic plotting techniques to visualize and understand key features of your data
- Use plots to visually compare two groups
Preliminaries: Download and Trim Data
To let you follow along at home, let’s first load some publicly available salary data from the state of California:
http://transparentcalifornia.com/export/2013-cities.csv
In the code below, read.csv is a function that does exactly what its name says: it reads csv files.
The url function opens the connection with the site itself.
In this case, we will read it directly into a dataframe that we will call “sal”.
Be forewarned that the file is somewhat large so it may take a minute or two to load.
Update: Please note that a thoughtful reader has let me know that this website has changed and this specific file is no longer available. I find this extremely irritating. I am leaving the post as is because there is actually no way for me to download that specific information. Regardless, you can still download other salary files at http://transparentcalifornia.com/salaries and the broader lessons of this post and the core of the code still apply. My apologies for this change. In the future, I will simply provide the csv directly instead of relying on the consistency of other sites.
[code language=”r”]
sal <- read.csv(url(“http://transparentcalifornia.com/export/2013-cities.csv”))
[/code]
To keep things focused and easy to view, I will also trim the data to just a few essential columns.
[code language=”r”]
sal <- sal [,-c(1, 6,8,9,10)] #using negative indexing to drop columns specified
[/code]
What’s in your data?
Whenever I get new data, I always explore the basic structure of the data itself before attempting any visualizations or analyses. Let’s start with the dimensions.
[code language=”r”]
dim(sal) # number of rows and columns
[/code]
## [1] 251701 6
This result shows us that we have 251701 rows in the data and 6 columns. In this dataset, the number of columns represents the number of different variables we have available to explore each individual instance of an employee and his/her associated pay data.
Next, we use the str function to understand the structure of our data.
[code language=”r”]
str(sal) #structure of the data
[/code]
## 'data.frame': 251701 obs. of 6 variables:
## $ Job.Title : Factor w/ 27550 levels "311 Call Center Manager",..: 3879 19108 19452 3766 1746 2284 21648 21782 6911 1224 ...
## $ Base.Pay : Factor w/ 170257 levels "-1013.46","-1048.00",..: 44636 11148 158644 160031 161488 143290 143290 150653 137192 97203 ...
## $ Overtime.Pay: Factor w/ 77780 levels "-0.01","-1.00",..: 69 69 69 69 69 69 69 69 69 2693 ...
## $ Other.Pay : Factor w/ 87438 levels "-0.01","-0.38",..: 70506 24619 48476 41682 80550 42984 36196 81455 43489 7355 ...
## $ Total.Pay : num 280039 128834 123856 118741 100919 ...
## $ Agency : Factor w/ 441 levels "Adelanto","Agoura Hills",..: 1 1 1 1 1 1 1 1 1 1 ...
Converting Factors to Numbers
One thing you probably noticed is that many of the pay fields were imported as factors even though they actually represent numbers that we wish to analyze. As you will see below, this creates problems.
[code language=”r”]
mean(sal$Base.Pay) # R gets angry here trying to do math on factors
[/code]
## Warning in mean.default(sal$Base.Pay): argument is not numeric or logical:
## returning NA
## [1] NA
There are two steps to fixing this problem:
- Convert the item from a factor to a character.
- Then convert it from a character to a number.
For consistency, let’s do this for all of the applicable fields.
[code language=”r”]
sal$Base.Pay <- as.character(sal$Base.Pay) # convert from factor to character
sal$Base.Pay <- as.numeric(sal$Base.Pay) # convert from character to number
sal$Overtime.Pay <- as.character(sal$Overtime.Pay) # convert from factor to character
sal$Overtime.Pay <- as.numeric(sal$Overtime.Pay) # convert from character to number
sal$Other.Pay <- as.character(sal$Other.Pay) # convert from factor to character
sal$Other.Pay <- as.numeric(sal$Other.Pay) # convert from character to number
[/code]
Spotting Errors Using Summaries
With the number fields now properly converted from factors to numbers, we can get something more useful from the summary command.
[code language=”r”]
summary(sal)
[/code]
## Job.Title Base.Pay Overtime.Pay
## Police Officer : 12169 Min. :-41354 Min. : -2223
## Police Officer II : 4679 1st Qu.: 9188 1st Qu.: 0
## POLICE OFFICER : 3455 Median : 49184 Median : 0
## Recreation Assistant: 2247 Mean : 49622 Mean : 4551
## Police Sergeant : 2110 3rd Qu.: 79402 3rd Qu.: 2385
## Police Officer III : 2105 Max. :353346 Max. :279105
## (Other) :224936 NA's :1148 NA's :1296
## Other.Pay Total.Pay Agency
## Min. :-75378 Min. : -7780 Los Angeles: 46918
## 1st Qu.: 0 1st Qu.: 9881 San Diego : 11190
## Median : 670 Median : 54073 San Jose : 7403
## Mean : 3894 Mean : 57982 Long Beach : 5980
## 3rd Qu.: 4170 3rd Qu.: 91630 Sacramento : 5533
## Max. :430040 Max. :526580 Oakland : 5386
## NA's :1292 (Other) :169291
Take a look that Job.Title column! Notice something bad? Yep, some jobs with the same title are treated separately because of differences in using upper and lower case.
Fixing Differences in Upper/Lower Case
We can fix this by making everything upper (or lower) case.
There are two steps to the process.
- Convert all of the factor items to a character string
- Make all of the character strings upper(or lower) case.
- Convert everything back to a factor.
Some of these steps could be combined within a single line but I find that can sometimes invite problems that are hard to spot later. If you are learning R, you will learn more quickly by being deliberate and explicit.
[code language=”r”]
sal$Job.Title <- as.character(sal$Job.Title) #Convert all of the factor items to a character string
sal$Job.Title <- toupper(sal$Job.Title) # Make all of the character strings upper(or lower) case.
sal$Job.Title <- factor(sal$Job.Title) #convert back to a factor. Helpful for later plotting
[/code]
Let’s run summary again with these issues cleaned up.
[code language=”r”]
summary(sal)
[/code]
## Job.Title Base.Pay Overtime.Pay
## POLICE OFFICER : 15629 Min. :-41354 Min. : -2223
## POLICE OFFICER II : 4927 1st Qu.: 9188 1st Qu.: 0
## POLICE SERGEANT : 2845 Median : 49184 Median : 0
## FIRE CAPTAIN : 2405 Mean : 49622 Mean : 4551
## RECREATION ASSISTANT: 2252 3rd Qu.: 79402 3rd Qu.: 2385
## RECREATION LEADER : 2181 Max. :353346 Max. :279105
## (Other) :221462 NA's :1148 NA's :1296
## Other.Pay Total.Pay Agency
## Min. :-75378 Min. : -7780 Los Angeles: 46918
## 1st Qu.: 0 1st Qu.: 9881 San Diego : 11190
## Median : 670 Median : 54073 San Jose : 7403
## Mean : 3894 Mean : 57982 Long Beach : 5980
## 3rd Qu.: 4170 3rd Qu.: 91630 Sacramento : 5533
## Max. :430040 Max. :526580 Oakland : 5386
## NA's :1292 (Other) :169291
Hmmm…well there are a few more things here that seem a tad problematic like the missing data (NAs) and the negative base pay values (see Min. under Base.Pay).
But such messiness is real and we need to tackle it. This is where learning R starts to pay off.
Let’s keep only those rows with positive pay values.
[code language=”r”]
sal <- sal[sal$Base.Pay > 0, ] # keep only rows with positive values for base pay
[/code]
sal <- sal[sal$Base.Pay > 0, ] # keep only rows with positive values for base pay
Now let’s drop those with missing values. (Note: More advanced methods for data imputation used to systematically replace missing values will be covered in future posts)
[code language=”r”]
sal <- sal[is.na(sal$Base.Pay) == FALSE,] # keep only those rows where base pay value is NOT missing (NA)
[/code]
sal <- sal[is.na(sal$Base.Pay) == FALSE,] # keep only those rows where base pay value is NOT missing (NA)
With the negative and missing values eliminated, we rerun our summary for the Base.Pay field.
[code language=”r”]
summary(sal$Base.Pay)
[/code]
summary(sal$Base.Pay)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2 10130.0 50120.0 50440.0 79960.0 353300.0
Much better. The summary data here tell us quite a bit, including that someone apparently made all of 20 cents for base pay in 2013. In a typical business context, such values would signal data processing and collection issues requiring further investigation.
We’ll leave this alone for now but the critical lesson here is that even running a simple summary command can help us quickly spot data quality issues.
Coming Up….
That’s it for Part 1. In Part 2, you will learn some additional basic but effective techniques for spotting more errors in your data as well as some powerful visualization approaches to compare different groups.
Like this post?
Get our FREE Turnover Mini Course!
You’ll get 5 insight-rich daily lessons delivered right to your inbox.
In this series you’ll discover:
- How to calculate this critical HR metric
- How turnover can actually be a GOOD thing for your organization
- How to develop your own LEADING INDICATORS
- Other insightful workforce metrics to use today
There’s a bunch more too. All free. All digestible. Right to your inbox.
Yes! Sign Me Up!
Comments or Questions?
Add your comments OR just send me an email: john@hranalytics101.com
I would be happy to answer them!
Contact Us
- © 2023 HR Analytics 101
- Privacy Policy