Tutorial: How to Create a Very Basic Recommender System
In a recent post we detailed three ways that HR, Human Capital, and learning professionals can leverage Netflix-style recommender systems to improve talent management, development, and learning processes… but you don’t need to be a machine learning expert to quickly develop and apply a very simple yet powerful recommender system of your own.
In today’s tutorial, we will introduce the k-means algorithm. The k-means algorithm is a kind of clustering algorithm that can help you quickly process heaps of data and then split people into meaningful groups.
Why Should I Care?
Segmenting our data into clusters can help us identify emergent, natural groups within our organization. This can help us target costly efforts like training, development, and retention in a way that makes use of all of our available data, not just our hunches or the last thing we remember hearing about.
The clusters you create can form the basis of a simple but impactful “recommender system” approach to key HR Analytics questions.
What You Will Learn:
- What a k-means algorithm is
- How k-means works
- How to apply one to your own HR data
Developing Intuitions: What Is the K-means Algorithm?
The k-means is a very simple clustering algorithm. Clustering algorithms take data and use mathematical techniques to find groups of similar items or people as using that data.
Intuitively, we use clustering all of the time. For example, suppose we are presented a group of 5 people with the following ages: 5, 6, 17, 46, 48. If we were asked to split that group into 2 sets, most of us would split the data into children (5,6,17) and adults (46 and 48). We don’t call that clustering but that is exactly what we are doing: grouping like with like based on the available data.
Clustering algorithms like the k-means algorithm do the same thing, but with tons of data at massive scale. It’s an easy way to augment our little brains.
How Does K-Means Work? The Process in a Nutshell
I am outlining the process here so you know generally how it works. But don’t worry: R takes care of all of it with just a single line of code.
To keep things focused, let’s suppose we have 20 people in a training class, each with 2 scores on a proficiency test. We want to create 2 separate clusters based on the scores of these first tests.
Here is how the k-means algorthim would work with this set up.
- Randomly pick 2 different points that serve as our initial cluster “centers”.
- Calculate the distance of each of our 20 data points from each of the centers.
- Assign each of those data points to the cluster with the closest center location.
- Recalculate each of the “centers” of our 2 clusters using just those data points that were assigned to that cluster. This will give us two new cluster center locations.
- Recalculate the distance of each of our 20 data points from each of these two new cluster center locations.
- Reassign each of the data points to the cluster with the closest center location. Note that some of these data points will stay with the same cluster assignment while others might change.
- Keep repeating this process until none of the data points need to be reassigned after calculating the new cluster centers.
What The Process Looks Like
This figure shows what our k-means clustering might produce after the first clustering run.
After our final run, we see that the cluster centers have moved locations. The cluster assignments for some of our data points have also therefore changed.
Example 1: K-means with a Single Variable
Sometimes its useful to put people into clusters using a single varible. In this example, let’s classify employees at a company according to tenure.
One solution is to choose some arbitrary number, say, 3 years and just split the employee population into two pieces. This is simple enough but it’s arbitrary and might misrepresent our data. In the following plot we can plainly see that just grouping by the 3 year mark misses some important structure. There are other groups in there but we are treating them as one big blob.
Coincidentally, this is why you should ALWAYS plot your data first before doing analyses.
Clearly, we need a better way to group our people. K-means to the rescue!
K-means R Code for Example 1
In the following, we’ll use some sample data. If you have your own data in an Excel file, just import it using the “Import Dataset” button in the RStudio console in the upper right. Then change the variable name to the one you want to examine.
We’ll use the “kmeans” function. The first argument is the data we want to use to create our cluster. The second is simply the number of centers (clusters) we want.
That’s really all you need to do your own cluster analysis.
### Creating some example data set.seed(101) t1 <- runif(20, 0, 3.5) t2 <- rnorm(20, 5, 2) t3 <- rnorm(20, 6, 2) t4 <- rnorm(40, 15, 3) data <- c(t1, t2, t3, t4) # Creating the k-means assignments k <- kmeans(x = data, centers = 4) # x is the data, centers is the number of cluster centers #str(k) # use the "str" function here to see all the output provided.
Let’s look at some of the output of the kmeans clustering algorithm available. We not only see how many people are in each cluster but also identify where the cluster centers are.
k$size # see how many values are in each cluster
##  16 22 30 32
k$centers # the cluster center values that determine the assignments to a cluster
## [,1] ## 1 18.015877 ## 2 13.586993 ## 3 6.402419 ## 4 2.053784
Note that the actual cluster number itself does not really mean anything. It’s arbitrary. What is NOT arbitrary, however, are the clustering assigments. Like values are grouped with like values.
We can see what this means in practice with a histogram plot using a different color for each cluster assignment. The clustering clearly fits the data better than some arbitrary split.
Note that I am showing the plotting code here for those playing along at home. Regardless of whether you are coding or not, the key lesson is that the cluster labels provide a nice way to group people by company tenure or whatever other variable you are looking at.
library(RColorBrewer) # I like these colors more h <- hist(data, breaks = 40, plot = F) #making the histogram and saving as a variable temp <- sort(k$centers) # getting the centers m1 <- c(mean(temp[1:2]), mean(temp[2:3]), mean(temp[3:4])) # finding the midpoints for coloring cuts <- cut(h$breaks,breaks = c(-Inf, m1, Inf)) # assigning the breaks of the histogram to cluster # Plotting the histogram with some pretty colors from R Color Brewer plot(h, col = brewer.pal(4,"Set1")[cuts], main = "Company Tenure Histogram \n Colored By Cluster")
Saving the Output to a CSV File
Now we can put our cluster assignment data together with our original data and then save it as a csv file. This is useful if we want to use our new cluster data in a program like Excel. The order of the cluster assignment data from k$cluster follows the same order as our original data.
data_all <- cbind(data, k$cluster) #binding them as columns write.csv(x = data_all, file = "our_filename.csv")
Example 2: K-means for 2 or More Variables
In the above example, we focused on a just a single measure. And k-means can indeed be used to help us identify natural groups for just one variable.
But k-means becomes really powerful when used with multiple variables. For most day-to-day HR cases, this will mean 2-5 variables, although one can go as high as a few dozen variables or a few hundred.
In this next example, we’ll use 4 variables from the famous Iris dataset that R already provides.
Our goal here is to see how well we can use k-means to discover clusters matching the species of iris.
Yes, this is about flowers, but the principles and the R code are the same.
Indeed, this highlights one of the truly powerful things about machine learning: the principles apply to all sorts of data, be they flowers, employees, or whatever.
data("iris") # load the iris data set str(iris) # view the structure
## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris) # sample first 6 rows
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
set.seed(101) # setting the random seed so I can reproduce the output k_iris <- kmeans(iris[,c(1:4)], 3) # using all the rows but just the first 4 columns for the
Now let’s see how well our cluster splits correspond to the different species.
## ## setosa versicolor virginica ## 1 0 48 14 ## 2 50 0 0 ## 3 0 2 36
I think this is amazing. I know nothing about flowers and yet with 2 lines of code in R, I was able to create clusters that mirror the actual species structures.
Cluster 2 categorizes perfectly and Cluster 3 nearly so. Cluster 1 is imperfect but overall, that’s pretty darn good from just 2 lines of code.
We can see what this clustering looks like plotting with just two of our dimensions.
col_cl <- brewer.pal(3,"Set1") # I hate the default colors so I am going fancy plot(iris[c("Petal.Length", "Petal.Width")], col=col_cl[k_iris$cluster], pch = 19) points(k_iris$centers[,c("Petal.Length", "Petal.Width")], col=col_cl, pch=8, cex=2)
Quick Template for Clustering with Your Own Data
Use this template to run a kmeans with your own data and then save that data as a csv file.
your_data <- read.csv("your_file_name_here") # you can use the "Import Dataset" button vars <- c(1,2,3,4) # or whatever columns you want to include; you can also use variable names k <- kmeans(your_data[,vars]) # Run the cluster analysis your_data_2 <- cbind(your_data,k$cluster) #add the cluster label output to your data write.csv(x = your_data_2, file = "your_new_filename.csv") # write your data to a csv
Will K-Means Always Produce the Best Answer?
Not always, but it generally works pretty darn well.
The first step in the process is randomly choosing the initial cluster center locations. This means that sometimes the output is influenced to some degree by that random starting point.
In rare cases, the answer will be substandard relative to other possible clustering outcomes.
In practice, you should always first set your random seed as I have done here before running kmeans. This will let you always reproduce the result you get.
In addition, I find it helpful to try some different random seeds to make sure the answers I am getting are broadly consistent across different runs.
Indeed, in creating this post, I found that every so often I got poor results. The solution was to set that random seed. Generally, though kmeans is fairly robust and you should get pretty similar results with different runs regardless of the random starting point. Just be sure to cover your bases.
How Do I Choose the Number of Centers? A Simple Approach in 4 Steps
The simple approach to answering this begins with what you intend to actually do with the data. Clustering is great, but if you don’t begin with the end in mind, you are wasting your time.
- Ask yourself how many clusters would be useful. If you are looking at ways to split up your initial trainee course into separate sections according to early performance, you might first think about how many instructors you have. If you only have three instructors, then choosing 2-3 clusters would probably make more sense than 7 or 8. If, on the other hand, you are talking about simply understanding the structure of your workfore, a few more clusters might be useful.
- Look at the sizes of each cluster. If you have a group of 24 people but one cluster contains only 3 people take a closer look. They might be really different and worth separating out or you just might have too many clusters.
- Relatedly, look at the values of the cluster center themselves and visualize. Are the differences practically meaningful? Consider test scores on scale of 0-100%. If the cluster1 center value is 87% and the cluster 2 center value is 89%, is that a difference that really matters? If those differences are not meaningful, reduce the number of clusters and rerun it. Please note that this method of determining how clusters really only works with 1 or 2 variables. Anything more and you won’t be able to properly visualize it.
- Don’t be afraid to experiment. Try a few clusters, try a bunch. See what makes sense to you and what will be useful for those who need to act on your new insights. For typical HR analytics needs, you will get surprising impact from just a few.
There is a more complicated approach but it is beyond our current scope. In brief, it looks at the decreasing explanatory power of adding more clusters. My advice is to start small and get comfortable with kmeans first. Then you can start to dig deeper if you need to. I provide some links at the end if you want to learn more.
How Many Variables Can I Use in K-means?
As I mentioned earlier, for most HR uses you will probably only use 2-5 variables. You can, however, greatly expand this up to dozens or, in principle, even hundreds or thousands. When you get into the hundreds or thousands of variables, you might run into something called “the curse of dimensionality” but this is not really relevant for HR analytics.
Can I Use Data with Different Scales?
Yes, BUT you need to get them on the same scale first. This is called normalization.
Why do you need to do this for items on vastly different scales? K-means works by reducing the distance between the data points and the cluster centers. Variables with larger scales can swamp the calculations and dominate the cluster calculations.
For example, suppose I am clustering employees according to tenure at the company and salary. Tenure is at the scale of individual years but salary is in the tens of thousands of dollars. If we just threw it all in, the clustering would be driven by the salary variable because the numbers are just larger. Normalizing our data first takes care of this problem.
To normalize a variable, just subtract the mean value for that variable from each value. Then divide those values by the standard deviation. You will end up with a variable that has a mean of 0 and a standard deviation of 1 for each of your normalized variables. The cluster assignment outcomes will correspond to each case just as before.
If you are not familiar with standard deviations, click here for an earlier introductory post.
Here is some sample code if you need to normalize a variable.
a <- runif(20, 0, 100) #Use your own variable here. norm_a <- (a-mean(a))/sd(a) #normalizing that data
We all benefit from recommender systems every day. Ironically, we are missing many opportunities to leverage these same basic systems in HR, Human Capital Analytics, and learning-related areas right under our noses; see this recent post for discussion.
You don’t need something super complicated to get started. You can do it yourself, right now in your organization.
You just need a clear question and TWO LINES OF CODE to run a kmeans
I am big on the 80-20 principle when it comes to analytics.
K-means analyses pack punch for simplicity and impact.
- Here is a great interactive tool from our friends at RStudio to help us learn more about kmeans.
- If you REALLY want to dig into the math, check out this wikipedia site.
- Need an analytically rigorous approach to selecting the number of clusters? If you have some statistics background and solid R chops, this stackoverflow post provides a good start.
Like this post?
Get our FREE Turnover Mini eCourse!
You’ll get 5 insight-rich daily lessons delivered right to your inbox.
In this series you’ll discover:
- How to calculate this critical HR metric
- How turnover can actually be a GOOD thing for your organization
- How to develop your own LEADING INDICATORS
- Other insightful workforce metrics to use today
There’s a bunch more too. All free. All digestible. Right to your inbox.
Yes! Sign Me Up!
Comments or Questions?
Add your comments OR just send me an email: firstname.lastname@example.org
I would be happy to answer them!
- © 2023 HR Analytics 101