A Practical Introduction to Standard Deviations for Human Capital, Part 1
The What, the Why, and the How for HR and Other Non-technical Professionals
Overview
HR professionals usually present just averages to business leaders and other decision makers. This practice can mask meaningful differences between groups.
What’s the solution? The standard deviation, a measure that tells us how much our values are spread out from those averages and from each other. Standard deviations provide context to help us understand the means and are also informative by themselves.
Here in Part 1, we explain what the standard deviation (SD) is and why you should care. In Part 2, we apply your new knowledge step-by-step to some real salary data.
By the end of this tutorial, you will be able to:
- Explain what the standard deviation is
- Understand why it is important
- Calculate the standard deviation in R
- Apply this knowledge to human capital and other business data
- Put your skills to work AT WORK
As always, I strongly recommend following along at home using the code snippets below.
Standard Deviation: The Basics
The Intuition
To understand the standard deviation, let’s start with the average (or the mean).
Suppose we have two groups of people. In the first group, we have 3 people who have taken 11, 13, and 15 unscheduled days off of work, giving us an average of 13 days. In our second group, we have 3 people who have taken 0, 13, and 26 unscheduled days off of work, again giving us an average of 13 days.
As I have noted in previous posts here and here, the mean is a useful measure because it summarizes all of the values we have into a single value.
We can then use that single values to compare groups.
Here then, we see that both groups have a mean of 13 days. But your intuition is telling you that these groups are different and your intution is right: those in the first group all have similar values while those in this second group are all over the place.
The problem is that averages tell us nothing about the spread or closeness of the values within those groups. For that we need to the standard deviation.
In essence, the standard deviation tells us how close our scores are to the group average. If the standard deviation is low, the scores are generally close to the mean and therefore close to one another. If the standard deviation is high, the individual scores are quite different from the mean and from one another.
The scores from Group 1 (11, 13, and 15) are less spread out than the scores from Group 2 (0, 13, 26). Accordingly, the standard deviation for Group 1 will be less than that for Group 2.
Let’s now move beyond glances and intuition.
The Calculation Steps and What They Mean
Here we show you the individual steps for calculating the standard deviation. Normally, you would not be doing this by hand, but stepping through the logic by hand will help you understand what the standard deviation really means. We’ll show the sd() function in a bit.
Step 1. Calculate the mean of the scores for your group. Think of the mean as the reference point.
g1 <- data.frame(days = c(11, 13, 15)) # assigning the values to group 1
mean(g1$days)
## [1] 13
Step 2. Subtract each of the individual scores from the mean. This is tells us how far each individual value is from that mean reference point.
g1$diff <- g1$days - mean(g1$days) # Step 2: get the difference
g1$diff
## [1] -2 0 2
Step 3. Square those difference scores and then add them all up. We are trying to get a total measure of all of the differences. If just added them up without squaring them first though, the positive and negative values would just cancel out.
g1$diffsq <- g1$diff^2 # squaring each of those values
g1$diffsq
tot_diff <- sum(g1$diffsq)
tot_diff
## [1] 4 0 4
## [1] 8
Step 4. Divide that by one less than the number of scores you have. Here, we have three scores so we divide our total by 2 (from 3-1 = 2). This is like getting an average of those difference scores. If we had 100 values, we would divide by 99 (from 100-1 = 99).
variance <- tot_diff/2 # Step 5: 3 total values minus one so 3-1 =2.
variance
## [1] 4
Step 5. Take the square root of that value we obtained in Step 4. In this step we are essentially undoing that squaring business in Step 3 so we have a number that we can interpret.
stdev <- (variance)^.5 # Step 6: get the square root of the variance
stdev
## [1] 2
Our final calculated standard deviation for Group 1 is 2.
Comparison with Group 2 Standard Deviation
Now let’s see what the standard deviation tells us when apply these same steps to Group 2:
g2 <- data.frame(days = c(0, 13, 26 )) #unscheduled days off
mean(g2$days) # Step 1: calculate the mean
g2$diff <- g2$days - mean(g2$days) # Step 2: get the difference
g2$diff
g2$diffsq <- g2$diff^2 # Step 3: square those difference.
g2$diffsq
tot_diff <- sum(g2$diffsq) # Step 4: square the summed differences
variance <- tot_diff/2 # Step 5: 3 total scores less one so 3-1 =2.
stdev <- (variance)^.5 # Step 6: get the square root of the variance
stdev
## [1] 13
## [1] -13 0 13
## [1] 169 0 169
## [1] 13
Here we get a standard deviation of 13, not 2. This makes sense because the spread of the scores in Group 2 was much greater than the spread from Group 1. The standard deviation captures this difference.
Now that you understand what the standard deviation is, let’s explore it R.
Calculating the Standard Deviation in R
In R, we just use the sd() function to calculate standard deviation.
sd(g1$days)
## [1] 2
sd(g2$days)
## [1] 13
Pretty easy.
Another Example with More Data
Now let’s see what the sd() function can do for us with a more realistic set of values. We’ll continue with a measure of unscheduled days off of work, but this time we will create two larger sets from scratch. Don’t worry if you don’t understand the code creating those data sets. Just copy it and run it to get the data you need for the SDs.
set.seed(102)
days_off <- data.frame(g1 = rnorm(200, mean = 15, sd = 2),
g2 = rnorm(200, mean = 15, sd = 5)) #create two sets of values, same mean different st dev
days_off <- apply(X = days_off, MARGIN = 2, FUN = round, digits = 0) # round to nearest integer
If we did the typical HR thing and only looked at the means of unscheduled days off, we would conclude that the groups are similar.
mean(days_off[,"g1"])
## [1] 15.125
mean(days_off[,"g2"])
## [1] 14.605
Simply checking the standard deviation tells us this conclusion would be wrong.
sd(days_off[,"g1"])
## [1] 2.156997
sd(days_off[,"g2"])
## [1] 4.533402
Visualizing Differences in Standard Deviations
Using histograms to compare the distributions is another valuable tool. Histograms actually help us SEE what differences in the standard deviation look like.
library(lattice)
library(reshape2)
days_off2 <- melt(days_off)
names(days_off2)[2:3] <- c("group", "days")
histogram(x = ~days | group, data = days_off2, layout = c(1,2), breaks = seq(0,50, 2))
The spread of unscheduled days off is much greater for those in group 2, exactly what those standard deviations told us. Seeing is indeed believing…but if we only saw the means we would believe the wrong thing.
Summary Points
Averages are crucial for understanding data, but they do not tell the whole story. Standard deviations tell us about the spread of the values and also provide important context when comparing means for different groups.
In some instances, we may find that standard deviations of two groups are quite different even when the means of the groups are similar.
Seeing that groups differ on the standard deviation, we can then start to consider relevant human capital and business questions. In the case of differences in unscheduled days off, we might ask:
- Are there meaningful differences in the work locations and commute times?
- What are the costs associated with widely variable unscheduled days off?
- What is the impact on staffing and productivity?
- Can we predict who will have unscheduled days off and when?
The take-home lesson here is that measuring the standard deviation will help us ask new and better questions than we would if just looking at averages alone.
Coming Up Next…
In Part 2, we will apply our new knowledge to real salary data to see how application of standard deviation measures can play out in the real world.
Like this post?
Get our FREE Turnover Mini Course!
You’ll get 5 insight-rich daily lessons delivered right to your inbox.
In this series you’ll discover:
- How to calculate this critical HR metric
- How turnover can actually be a GOOD thing for your organization
- How to develop your own LEADING INDICATORS
- Other insightful workforce metrics to use today
There’s a bunch more too. All free. All digestible. Right to your inbox.
Yes! Sign Me Up!
Comments or Questions?
Add your comments OR just send me an email: john@hranalytics101.com
I would be happy to answer them!
photo credit: <a href=”http://www.flickr.com/photos/8070463@N03/14177490096″>Dad and his son II</a> via <a href=”http://photopin.com”>photopin</a> <a href=”https://creativecommons.org/licenses/by-nd/2.0/”>(license)</a>
Contact Us
- © 2023 HR Analytics 101
- Privacy Policy