Descriptive HR Analytics, Part 2: Highlighting Data

As you have no doubt learned along your HR analytics journey, there is just no getting around the dirty work of data science, namely cleaning and reshaping. But that’s no excuse for making sloppy or unappealing visualization when you need to share your work. There is a time for quick and dirty but there is also a time for neat and tidy.

In today’s post I’ll share a few simple color tricks for effectively highlighting key data points.

Make the Data

We’ll start by generating some data relating age and salary by running the code below. You can just go ahead and copy and run this code to create the dataset.

For those new to HR Analytics, you can also use this as some starter code to help you learn about basic data handling and some of the more common operations that you’ll need to master.

I’ve heavily annotated the data creation steps below so you understand what each piece is about but feel free to drop me a line if something is not clear.

## Adding the libraries we need
library(ggplot2)
library(RColorBrewer) 


set.seed(42)

# Normal distribution for age
age <- rnorm(200, 40, 6)

## Some random variation that we'll add
## to make things more realistic
noise <- rnorm(200, mean = 8000, sd = 3000 )

## Making salary a function of age
## plus noise and a 20K starting point
salary <- (age*500) + noise + 20000


### Assigning salary quantile cuttoffs to the cut point
### Then assigning those quantile values to each individual row

quart  <- as.numeric(cut(salary, breaks = quantile(salary, probs = seq(0, 1, 0.25)), 
      include.lowest = TRUE, labels = 1:4))

### Creating a simple linear regression model
### regessing salary on age for later use in figures
m1 <- lm(salary~age)

### Getting the residuals
### These measure how far off the model 
### prediction was
resid <- m1$residuals

### getting the top and bottom 5% cuttoff for the residuals
### Used to identify those points where the model was the most off 
temp_quantile <- quantile(resid, c(.05, .95))

### creating a residual outlier field
out <- ifelse(resid <= temp_quantile[1], 1, (ifelse(resid >= temp_quantile[2],1,0)))

# Bringing them all together into a single dataframe

hr <- data.frame(age, salary, quart, out)

Highlighting with Basic Plots

We’ll start with the simpest of scatterlots using the basic plot function, setting the shape with pch = 19 and my favorite color red.

plot(hr$age, hr$salary, pch = 19, col = 'red3', xlab = 'Age', ylab = 'Salary')

Not a bad start but a little meh.

Let’s get a bit more sophisticated by color the points according to what quartile the salary falls into. We can does this by setting the color to hr$quart. We have numbers there so R will just assign black to quartile 1, red to quartile 2, etc.

plot(hr$age, hr$salary, pch = 19, col = hr$quart, xlab = 'Age', ylab = 'Salary')

Sort of better but a tad busy. Let’s instead highlight, say, the ones in the 3rd quartile. To do this, we’ll create a separate color vector where we use ‘red3’ for those in quartile 3 and grey for the rest. For this we’ll use the ifelse statement

Then we’ll assign the col by referring to that vector of colors

temp_color <- ifelse(hr$quart == 3, 'red3', 'grey')
plot(hr$age, hr$salary, pch = 19, col = temp_color, xlab = 'Age', ylab = 'Salary')

The contrast between the grey and the red points (which we are trying to highlight) really just makes things pop.

Finally, we’ll add the basic regression line of salary regressed on age using the abline function.

temp_color <- ifelse(hr$quart == 3, 'red3', 'grey')
plot(hr$age, hr$salary, pch = 19, col = temp_color, xlab = 'Age', ylab = 'Salary')
abline(m1, col = 'dodgerblue')

Highlighting in ggplot2

The contrast between the grey and the red is SUPER strong and really draws the eye to the red points. I arbitrarily picked the 3rd quartile above, but in the following example we’ll show a few different examples, this time using ggplot2.

We’ll start by first declaring a vector of colors using the ‘Set1’ color pallete from R color brewer (the library we loaded above). Note that I am also just choosing the first 4 of these colors because that is the max I would for my quartiles.

### plot with GGPLOT2
color_set <- brewer.pal(4, 'Set1')

ggplot(hr, aes(age, salary)) + geom_point(color = color_set[quart])

That might be a helpful example of how to assign multiple different colors, but it’s a tad busy.

Let’s instead try our trick of assigning greys first.

### plot grey 
ggplot(hr, aes(age, salary)) + geom_point(color = 'grey')

Let’s see what happens if we assign red to the top quartile and grey everywhere else. We’ll do this here by using ifthen logic for the color parameter in our geom_point layer.

### plot grey with red for top quartile

ggplot(hr, aes(age, salary)) + geom_point(color = ifelse(hr$quart < 4, 'grey', 'red3'))

Sharp!

Now let’s use the same trick but instead highlight our datapoints with extreme residuals (that is, the furthest from the regression line).

### plot grey with red for residuals
ggplot(hr, aes(age, salary)) + geom_point(color = ifelse(hr$out == 1,'red3', 'grey'))
ggplot(hr, aes(age, salary)) + geom_point(color = ifelse(hr$out == 1,'red3', 'grey')) +
    geom_smooth(method = 'lm', color = '#3D7699')

Summary

This one is short and sweet: use color to highlight the specific data points of interest and use grey for the rest.

Additional Resources

  • Use the col2rgb function to get the RGB values of your standard colors (e.g. try “col2rgb(‘red3’)”” )
  • Plug those values (or other favorites) into this very handy Adobe color tool and see what combinations you like (colorwheel tool)[https://color.adobe.com/create/color-wheel/]

Contact Us

Yes, I would like to receive newsletters from HR Analytics 101.