HR Data: Visualize with Histograms First
People usually just want someone to get to the point. In HR data, this usually means one thing: an average.
As we have noted in previous explorations (here and here) though , sometimes that average can misrepresent the truth. To avoid such issues, it is critical to plot your data first before pumping out averages, bar graphs, and the like.
In this tutorial, I am going to step you through the interpretation of two datasets with similar averages and standard deviations using a histogram. The histogram is the first plot you should ALWAYS make when trying to an answer any question with data.
It might seem simple (or even simplistic), but I assure you that taking this basic step will substantially enhance your understanding of your data and help you avoid common “data-based” mistakes and bad decisions.
By the end of this brief tutorial, you will specifically be able to:
- Use a histogram to see what values pop up in your data and how often
- Know when you should and should not use the average to meaningfully summarize your data
To keep things focused on HR, we are going to use simulated sales data from our team of sales reps. The data is simulated but pattern broadly reflects the kinds of patterns you are likely to see in larger organizations.
As always, you can learn the key lessons just by reading but I strongly recommend following along at home using the code snippets below to maxmize your understanding and retention.
Make Our Data
Let’s start by just creating simulated sales data to play with.
# install.packages("poweRlaw") #Run this line without the #s if you don't have this package library("poweRlaw") set.seed(2) p1 <- 1000*(rpldis(1000, xmin=20, alpha=10)+runif(1000,0,1)) #skewed distribution n1 <- rnorm(1000, mean(p1), sd(p1)) # normal distribution
Now that we have our data in place, let’s run our standard summary measures, the mean and the standard deviation.
##  22529.64 ##  22589.42
The means are practically the same.
##  2844.134 ##  2827.166
The standard distributions are practically the same too.
If we just went by the mean and standard deviation as many do, then we would conclude these are similar. After all, if the average and the spread (standard deviation) are essentially the same, what else is there?
Plot Our Histograms
When we plot the histograms, we see there is plenty.
hist(p1, xlim = c(15000, 40000), ylim = c(0,400), xlab = "Sales ($)", ylab = "Number of Sales Agents", breaks = 30, col = "red", main = "Histogram for Skewed Sales Data")
hist(n1, xlim = c(15000, 40000), ylim = c(0,400), xlab = "Sales ($)", ylab = "Number of Sales Agents", breaks = 30, col = "red", main = "Histogram of Normally Distributed Sales Data")
Remember that a histogram takes all of the values for a given measure (in this case sales data) and puts each of them into bins. In this case, we have bins for every $1000 in sales. Thus, sales between $20,000 and $21,000 go in one bin, sales between $21,000 and $22,000 go into the next next bin, etc. It then counts how many observations are in each of the bins to determine the height of each.
The first histogram reveals that the vast majority of sales reps sold roughly $21,000 worth of product with NONE selling below $20,000. This is definitely NOT what one might expect given the average and the standard deviation. In addition, we also see that a few sold roughly $35,000 worth of items. Again, totally unexpected.
In stats jargon, we say a distribution is “skewed” when the values are clustered around one end of the distribution or another. This first set of sales data is strongly skewed.
The histogram for the second dataset presents a totally different picture despite having essentially the same average and standard deviation. Instead of being skewed, we see that the sales are quite balanced around the mean, with approximately half below the average of $22,589 and half above it. The is classic “normal curve” stuff.
If we stuck with just averages (and perhaps the st. dev) the way most people do, we would’ve said that the sales performance of these two groups was the same. The histogram instead shows us that we have two totally different patterns of behavior.
In the case of the skewed distribution, we can see clearly that low sales are the rule rather than the exception, with the VAST majority of reps clustering around $21,000. Some resulting questions we may wish to ask when faced with a skewed distribution include the following:
- Is this an artefact of how we measured something or a problem with the data? (Note: You should ALWAYS ask yourself this when you get something unexpectedly good, bad, or weird)
- What percentage of our TOTAL sales are accounted for by just our high performers?
- If we want to increase sales, where should we concentrate our efforts? (Hint: The tall bar!)
- Is there something hugely different between our top sellers and everyone else? Territory? Management? Competition in the region?
Looking at the second, normally distributed group, we may ask some slightly different questions such as:
- What can we do to bring up the lower end performers?
- Should we retain the agents on the lower end? Should we retrain them?
- Is the distribution of sales correlated with a similarly distributed factor like experience?
Where To Look for Skewed Distributions in HR Data
Yes, we used simulated data for ease of illustration but skewed distributions pop all over in HR and Human Capital data. For example,
- Salary data (both within an organization and nationally)
- Frequency of contributions by employees in an enterprise social network
- Employee absences
- Annual turnover broken down by different business areas or departments
We have seen how two different groups with similar averages and standard deviations for sales data can have starkly different performance profiles. These different profiles suggest different questions and different patterns of action and intervention. The key lesson? Plot your data histograms first to understand the broad profile before you do anything else.
In thinking about your own HR analytics challenges, I can all but guarantee three things:
- You have skewed data in many key HR metrics in your organization.
- Your organization is focusing on the averages when talking about these metrics.
- No one has plotted a histogram of these data.
Plot your data before going for the more complicated stuff. You will save yourself and your colleagues valuable time by asking the right questions from the outset and identifying were to most effectively target your time and resources.
Like this post?
Get our FREE Turnover Mini Course!
You’ll get 5 insight-rich daily lessons delivered right to your inbox.
In this series you’ll discover:
- How to calculate this critical HR metric
- How turnover can actually be a GOOD thing for your organization
- How to develop your own LEADING INDICATORS
- Other insightful workforce metrics to use today
There’s a bunch more too. All free. All digestible. Right to your inbox.
Yes! Sign Me Up!
Comments or Questions?
Add your comments OR just send me an email: email@example.com
I would be happy to answer them!
- © 2022 HR Analytics 101