Analyzing Salary Data with R, Part 1: Essential Data Processing Techniques

unnamed-chunk-15-2

Overview

Visualizing and understanding salary information is a crucial function of HR, Human Capital Analytics, and numerous other business segments. In today’s tutorial, we will introduce you to a few essential techniques for reviewing and visualizing salary data.

At the end of Part 1, you will be able to

  • Quickly know what’s in your data
  • Find and eliminate some common salary data quality issues

At the end of Part 2, you will be able to

  • Use basic plotting techniques to visualize and understand key features of your data
  • Use plots to visually compare two groups

Preliminaries: Download and Trim Data

To let you follow along at home, let’s first load some publicly available salary data from the state of California:

http://transparentcalifornia.com/export/2013-cities.csv

In the code below, read.csv is a function that does exactly what its name says: it reads csv files.

The url function opens the connection with the site itself.

In this case, we will read it directly into a dataframe that we will call “sal”.

Be forewarned that the file is somewhat large so it may take a minute or two to load.

Update: Please note that a thoughtful reader has let me know that this website has changed and this specific file is no longer available. I find this extremely irritating. I am leaving the post as is because there is actually no way for me to download that specific information. Regardless, you can still download other salary files at http://transparentcalifornia.com/salaries and the broader lessons of this post and the core of the code still apply. My apologies for this change. In the future, I will simply provide the csv directly instead of relying on the consistency of other sites.

[code language=”r”]

sal <- read.csv(url(“http://transparentcalifornia.com/export/2013-cities.csv”))

[/code]

To keep things focused and easy to view, I will also trim the data to just a few essential columns.

[code language=”r”]

sal <- sal [,-c(1, 6,8,9,10)] #using negative indexing to drop columns specified

[/code]

What’s in your data?

Whenever I get new data, I always explore the basic structure of the data itself before attempting any visualizations or analyses. Let’s start with the dimensions.

[code language=”r”]

dim(sal) # number of rows and columns

[/code]

## [1] 251701      6

This result shows us that we have 251701 rows in the data and 6 columns. In this dataset, the number of columns represents the number of different variables we have available to explore each individual instance of an employee and his/her associated pay data.

Next, we use the str function to understand the structure of our data.

[code language=”r”]

str(sal) #structure of the data

[/code]

## 'data.frame':    251701 obs. of  6 variables:
##  $ Job.Title   : Factor w/ 27550 levels "311 Call Center Manager",..: 3879 19108 19452 3766 1746 2284 21648 21782 6911 1224 ...
##  $ Base.Pay    : Factor w/ 170257 levels "-1013.46","-1048.00",..: 44636 11148 158644 160031 161488 143290 143290 150653 137192 97203 ...
##  $ Overtime.Pay: Factor w/ 77780 levels "-0.01","-1.00",..: 69 69 69 69 69 69 69 69 69 2693 ...
##  $ Other.Pay   : Factor w/ 87438 levels "-0.01","-0.38",..: 70506 24619 48476 41682 80550 42984 36196 81455 43489 7355 ...
##  $ Total.Pay   : num  280039 128834 123856 118741 100919 ...
##  $ Agency      : Factor w/ 441 levels "Adelanto","Agoura Hills",..: 1 1 1 1 1 1 1 1 1 1 ...

Converting Factors to Numbers

One thing you probably noticed is that many of the pay fields were imported as factors even though they actually represent numbers that we wish to analyze. As you will see below, this creates problems.

[code language=”r”]

mean(sal$Base.Pay) # R gets angry here trying to do math on factors

[/code]

## Warning in mean.default(sal$Base.Pay): argument is not numeric or logical:
## returning NA
## [1] NA

There are two steps to fixing this problem:

  1. Convert the item from a factor to a character.
  2. Then convert it from a character to a number.

For consistency, let’s do this for all of the applicable fields.

[code language=”r”]

sal$Base.Pay <- as.character(sal$Base.Pay) # convert from factor to character
sal$Base.Pay <- as.numeric(sal$Base.Pay) # convert from character to number

sal$Overtime.Pay <- as.character(sal$Overtime.Pay) # convert from factor to character
sal$Overtime.Pay <- as.numeric(sal$Overtime.Pay) # convert from character to number

sal$Other.Pay <- as.character(sal$Other.Pay) # convert from factor to character
sal$Other.Pay <- as.numeric(sal$Other.Pay) # convert from character to number

[/code]

Spotting Errors Using Summaries

With the number fields now properly converted from factors to numbers, we can get something more useful from the summary command.

[code language=”r”]

summary(sal)

[/code]

##                 Job.Title         Base.Pay       Overtime.Pay   
##  Police Officer      : 12169   Min.   :-41354   Min.   : -2223  
##  Police Officer II   :  4679   1st Qu.:  9188   1st Qu.:     0  
##  POLICE OFFICER      :  3455   Median : 49184   Median :     0  
##  Recreation Assistant:  2247   Mean   : 49622   Mean   :  4551  
##  Police Sergeant     :  2110   3rd Qu.: 79402   3rd Qu.:  2385  
##  Police Officer III  :  2105   Max.   :353346   Max.   :279105  
##  (Other)             :224936   NA's   :1148     NA's   :1296    
##    Other.Pay        Total.Pay              Agency      
##  Min.   :-75378   Min.   : -7780   Los Angeles: 46918  
##  1st Qu.:     0   1st Qu.:  9881   San Diego  : 11190  
##  Median :   670   Median : 54073   San Jose   :  7403  
##  Mean   :  3894   Mean   : 57982   Long Beach :  5980  
##  3rd Qu.:  4170   3rd Qu.: 91630   Sacramento :  5533  
##  Max.   :430040   Max.   :526580   Oakland    :  5386  
##  NA's   :1292                      (Other)    :169291

Take a look that Job.Title column! Notice something bad? Yep, some jobs with the same title are treated separately because of differences in using upper and lower case.

Fixing Differences in Upper/Lower Case

We can fix this by making everything upper (or lower) case.

There are two steps to the process.

  1. Convert all of the factor items to a character string
  2. Make all of the character strings upper(or lower) case.
  3. Convert everything back to a factor.

Some of these steps could be combined within a single line but I find that can sometimes invite problems that are hard to spot later. If you are learning R, you will learn more quickly by being deliberate and explicit.

[code language=”r”]

sal$Job.Title <- as.character(sal$Job.Title) #Convert all of the factor items to a character string
sal$Job.Title <- toupper(sal$Job.Title) # Make all of the character strings upper(or lower) case.
sal$Job.Title <- factor(sal$Job.Title) #convert back to a factor. Helpful for later plotting

[/code]

Let’s run summary again with these issues cleaned up.

[code language=”r”]

summary(sal)

[/code]

##                 Job.Title         Base.Pay       Overtime.Pay   
##  POLICE OFFICER      : 15629   Min.   :-41354   Min.   : -2223  
##  POLICE OFFICER II   :  4927   1st Qu.:  9188   1st Qu.:     0  
##  POLICE SERGEANT     :  2845   Median : 49184   Median :     0  
##  FIRE CAPTAIN        :  2405   Mean   : 49622   Mean   :  4551  
##  RECREATION ASSISTANT:  2252   3rd Qu.: 79402   3rd Qu.:  2385  
##  RECREATION LEADER   :  2181   Max.   :353346   Max.   :279105  
##  (Other)             :221462   NA's   :1148     NA's   :1296    
##    Other.Pay        Total.Pay              Agency      
##  Min.   :-75378   Min.   : -7780   Los Angeles: 46918  
##  1st Qu.:     0   1st Qu.:  9881   San Diego  : 11190  
##  Median :   670   Median : 54073   San Jose   :  7403  
##  Mean   :  3894   Mean   : 57982   Long Beach :  5980  
##  3rd Qu.:  4170   3rd Qu.: 91630   Sacramento :  5533  
##  Max.   :430040   Max.   :526580   Oakland    :  5386  
##  NA's   :1292                      (Other)    :169291

Hmmm…well there are a few more things here that seem a tad problematic like the missing data (NAs) and the negative base pay values (see Min. under Base.Pay).

But such messiness is real and we need to tackle it. This is where learning R starts to pay off.

Let’s keep only those rows with positive pay values.

[code language=”r”]

sal <- sal[sal$Base.Pay > 0, ] # keep only rows with positive values for base pay

[/code]

sal <- sal[sal$Base.Pay > 0, ] # keep only rows with positive values for base pay

Now let’s drop those with missing values. (Note: More advanced methods for data imputation used to systematically replace missing values will be covered in future posts)

[code language=”r”]

sal &lt;- sal[is.na(sal$Base.Pay) == FALSE,] # keep only those rows where base pay value is NOT missing (NA)

[/code]

sal <- sal[is.na(sal$Base.Pay) == FALSE,] # keep only those rows where base pay value is NOT missing (NA)

With the negative and missing values eliminated, we rerun our summary for the Base.Pay field.

[code language=”r”]

summary(sal$Base.Pay)

[/code]

summary(sal$Base.Pay)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.2  10130.0  50120.0  50440.0  79960.0 353300.0

Much better. The summary data here tell us quite a bit, including that someone apparently made all of 20 cents for base pay in 2013. In a typical business context, such values would signal data processing and collection issues requiring further investigation.

We’ll leave this alone for now but the critical lesson here is that even running a simple summary command can help us quickly spot data quality issues.

Coming Up….

That’s it for Part 1. In Part 2, you will learn some additional basic but effective techniques for spotting more errors in your data as well as some powerful visualization approaches to compare different groups.

Like this post?

Get our FREE Turnover Mini Course!

You’ll get 5 insight-rich daily lessons delivered right to your inbox.

In this series you’ll discover:

  • How to calculate this critical HR metric
  • How turnover can actually be a GOOD thing for your organization
  • How to develop your own LEADING INDICATORS
  • Other insightful workforce metrics to use today

There’s a bunch more too. All free. All digestible. Right to your inbox.

Yes! Sign Me Up!

Comments or Questions?

Add your comments OR just send me an email: john@hranalytics101.com

I would be happy to answer them!

Contact Us

Yes, I would like to receive newsletters from HR Analytics 101.