Analyzing Salary Data with R, Part 2: Essential Visualization Techniques
Visualizing and understanding salary information is a crucial function of HR, Human Capital Analytics, and numerous other business segments. In Part 1 of this series you learned some foundational processing technique to help you clean and visualize your data. Here in Part 2, we take the next step using some foundational yet powerful visualization techniques. By the end of Part 2, you will be able to
- Use basic plotting techniques to visualize and understand key features of your data
- Use boxplots to visually compare two groups on salary
Preliminaries: Download and Trim Data
If you want to follow along at home, the code below reproduces that from Part 1 and serves as the launching point for Part 2. If you are a beginner in analytics and R, you may wish to complete Part 1 first.
sal <- read.csv(url(“http://transparentcalifornia.com/export/2013-cities.csv”))
sal <- sal [,-c(1, 6,8,9,10)] #using negative indexing to drop columns specified
sal$Base.Pay <- as.character(sal$Base.Pay) # convert from factor to character
sal$Base.Pay <- as.numeric(sal$Base.Pay) # convert from character to number
sal$Overtime.Pay <- as.character(sal$Overtime.Pay) # convert from factor to character
sal$Overtime.Pay <- as.numeric(sal$Overtime.Pay) # convert from character to number
sal$Other.Pay <- as.character(sal$Other.Pay) # convert from factor to character
sal$Other.Pay <- as.numeric(sal$Other.Pay) # convert from character to number
sal$Job.Title <- toupper(as.character(sal$Job.Title)) #convert factor to character string and make lower case
sal$Job.Title <- factor(sal$Job.Title) #convert back to a factor. Helpful for later plotting
sal <- sal[sal$Base.Pay > 0, ] # keep only rows with positive values for base pay
sal <- sal[is.na(sal$Base.Pay) == FALSE,] # keep only those rows where base pay value is NOT missing (NA)
Plotting Sorted Values
A great way to spot basic problems (such as unexpected gaps) is to plot the sorted values. To do this, we apply our plot function to the SORTED values from low to high (Note: If you are using an older computer, you may prefer to plot the sampled data in the code because it will use fewer data points will not overload your system.)
### sampling data for plotting with an older computers
# sam <- sample(sal$Base.Pay, size= 50000, replace = FALSE)
# plot(sort(sam), main = “Sorted Base Pay”) #creating a plot object to examine the stats later
plot(sort(sal$Base.Pay), main = “Sorted Base Pay”) #plotting sorted base pay
In this case, I expected a relatively smooth, continuous curve and that is what I have. If the curve had a big gap or huge spikes, we would probably want to dig deeper to figure what the causes were. As it stands, everything looks more or less solid. Based on the plot, though, I did get curious about those earning more than $235K.
table(sal$Base.Pay > 235000) # finding how many are making more than $235000
## ## FALSE TRUE ## 246390 91
Histograms provide a quick view of the distribution of your data. In this instance, the histogram tells that we seem to have two distinct groups, perhaps due to the difference between part-time v. full-time workers (although the data set here does not provide those fields).
hist(sal$Base.Pay, breaks = 100) # vary the number of breaks (or bins) to see the impact
Next, we try boxplots, a great visualization tool that tidily summarizes piles of data.
bp <- boxplot(sal$Base.Pay, main = “Base Pay Box Plot”) #creating a plot object to examine the stats later
If you are unfamiliar with boxplots, we will cover this in a future tutorial. For now, we’ll just focus on the actual box itself. The thick line in the middle of the box is the median (the mid-point in the series of values) and matches the value we got in the summary data above. The top part of the box represents the 3rd quartile (75%) while the bottom represents the 1st quartile (25%).
Based on these values, we can therefore conclude that 50% of those in this have a base pay between approximately $10,000 and $80,000 with a median of roughly $50,000. That’s a huge range, no doubt the result of combining all sorts of workers from all sorts of jobs.
Rather than plotting directly, we instead assigned the boxplot to the variable bp using the assignment function <- . This gives us the ability to explore the values used to create the boxplot directly. Using bp$stats, for example, gives us the 1st Quartile, median, and 3rd Quartile values in spots 2,3, and 4 respectively.
## List of 6 ## $ stats: num [1:5, 1] 2.30e-01 1.01e+04 5.01e+04 8.00e+04 1.85e+05 ## $ n : num 246481 ## $ conf : num [1:2, 1] 49895 50339 ## $ out : num [1:824] 216300 230048 198155 203762 189999 ... ## $ group: num [1:824] 1 1 1 1 1 1 1 1 1 1 ... ## $ names: chr ""
## [,1] ## [1,] 0.23 ## [2,] 10133.00 ## [3,] 50116.80 ## [4,] 79956.00 ## [5,] 184688.08
Comparing Specific Groups
In many instances, of course, we are much more focused and wish to compare specific groups. Let’s trim our data further compare POLICE OFFICERS and POLICE OFFICERS II. We also need to drop the empty levels by refactoring or our plot will blow up.
select_officers <- c(“POLICE OFFICER”, “POLICE OFFICER II”)
sal <- sal[sal$Job.Title %in% select_officers,] # keeping those in the set
sal$Job.Title <- factor(sal$Job.Title) #refactoring to eliminate empty levels.
Now lets compare them using boxplots. In this instance we’ll use the formula interface to model Base Pay as a function of Job Title
boxplot(sal$Base.Pay ~ sal$Job.Title)
This view is revealing. First, there are a few officers that APPEAR to be making over 125K per year. I say “appear”” of course because we know data is not always entered correctly (witness our previous negative base pay values). For present purposes, though, we will accept them. In your individual business, unexpectedly high (low) values call for a bit more investigation and these tools provide a great way to spot them quickly.
Second, as might be expected, the median for POLICE OFFICERS is slightly lower than that for POLICE OFFICER II. Note that the the middle 50% (bounded box regions) is also larger for the POLICE OFFICER. This might be due to narrower range of service years for POLICE OFFICER II, how much cities pay, the level of experience for those officers in the given cities, or some other factor.
Digging Deeper with Boxplots and Histograms
To move closer to an apples-to-apples comparison, let’s narrow our investigation down to just the two largest cities, LA and San Diego.
select_cities <- c(“San Diego”, “Los Angeles”) # selected cities
sal <- sal[sal$Agency %in% select_cities, ]
sal$Agency <- factor(sal$Agency) # again, refactoring to drop empty levels
Now, let’s model base pay by both officer level as well as city (Agency). This should yield 4 box plots (LA/ San Diego X Officer/Officer II) but as you will see below, it doesn’t.
boxplot(sal$Base.Pay ~ sal$Job.Title + sal$Agency, main = “OH NO! WHAT IS WRONG HERE???”,
cex.main = 2, col.main = “red”)
There should be 4 boxplots, not 2! What gives? Let’s check our data using the table function.
## ## Los Angeles San Diego ## POLICE OFFICER 0 1964 ## POLICE OFFICER II 4519 0
That’s strange! LA only has Level 2 officers, San Diego only Level 1. That’s definitely not what I expected. We can’t tell whether these titles are correctly applied or of if there is something wrong with the dataset itself. Regardless, it strongly reinforces our core lesson: plot your data!
I should also note that this suggests that the differences between Officer and Officer II base pay were likely NOT due to seniority as I originally suggested above. Again, check your assumptions by plotting your data.
Given this development, city and officer level correspond perfectly (at least for LA and SD). We can therefore just drop the officer level distinction and compare the cities directly.
boxplot(sal$Base.Pay ~ sal$Agency, main = “Comparison of Police Officer Pay”)
This simple black and white boxplot tells us quite a bit, namely that LA officers in the middle 50% earn substantially more than their San Diego counterparts.
Visualization Using Stacked Histograms and Density Plots
Let’s take it a step further with some help from the lattice package, which provides for some handy methods for arranging different data views. Stacked histograms provide an easy visual comparison, as do density plots.
# install.packages(“lattice”) # install the package
# library(lattice) # load the library after installing
histogram(data = sal, ~Base.Pay|Agency, layout = c(1,2), #stacked histogram
main = “Stacked Histogram: Los Angeles and San Diego”,
xlab = “Base Pay (USD)”)
densityplot(data = sal, ~Base.Pay|Agency, layout = c(1,2), #stacked density plot
main = “Stacked Density Plot: Los Angeles and San Diego”,
xlab = “Base Pay (USD)”)
The density plot in particular reveals three distinct groups, likely due to something about the pay structure. Again, it pays to plot your data.
Visualization Using Boxplots and Density Plots with ggplot2
Finally, let’s look at some basic plots using ggplot2. This is a great toolset and will be a key feature of many future tutorials. Here, we will use the qplot function which leverages ggplot2 but in a more intuitive, user-friendly way. (Note: If you are new to ggplot2, I suggest starting with the qplot commands first and then building up from there.)
The first plot is the same boxplot we saw above but is a bit easier on the eyes. The second is another density plot similar to the lattice plot above, but featuring visual overlap and with a bit more smoothness. Note the visible “lumpiness”, particular for the San Diego officer salaries.
# install.packages(“ggplot2”) # install the package
# library(ggplot2) # load the library after installing
qplot(data = sal, x = Agency, y = Base.Pay, geom = “boxplot”, fill= Agency) # basic boxplot
qplot(data = sal, x = Base.Pay, fill = Agency, geom = “density”, alpha = I(.8),
xlab = “Salary”, ylab = “Density”, main = “Los Angeles and San Diego Police Officer Salaries”) +
scale_x_continuous(breaks = c(0,50000, 100000, 150000),
labels = c(“$0”, “$50,000”, “$100,000”, “$150,000″)) +
theme(axis.text=element_text(size=12, face =”bold”, color = “black”) ,
axis.ticks = element_blank(), axis.text.y = element_blank()) +
scale_fill_manual(values = c(“blue”, “dark red”))
R offers a number of fantastic tools for handling and visualizing data, perfect for the decidedly messy world of HR and Human Capital data. From the salary data alone, we saw the need for some basic cleaning and familiarization as well as basic and more advanced visualization techniques. There is always more to learn but the tools here should help you along the way.
Like this post?
Get our FREE Turnover Mini Course!
You’ll get 5 insight-rich daily lessons delivered right to your inbox.
In this series you’ll discover:
- How to calculate this critical HR metric
- How turnover can actually be a GOOD thing for your organization
- How to develop your own LEADING INDICATORS
- Other insightful workforce metrics to use today
There’s a bunch more too. All free. All digestible. Right to your inbox.
Yes! Sign Me Up!
Comments or Questions?
Add your comments OR just send me an email: firstname.lastname@example.org
I would be happy to answer them!
- © 2022 HR Analytics 101