Tutorial: Using R to Create Team-level Measures from Individual-level HR data
People live and work in groups. Yet most of our HR data analytics work overlooks the contribution of team-level elements to individual and leader succcess (or failure). In today’s tutorial I will show you three different methods for getting team-level data from traditional “one person, one row” HR data.
The Setup
Let’s suppose we want to explore the relationship between team-level experience and individual level performance. To do this, we need a way to create team-level data from the rows of traditional HR data. Trying to do this in Excel would be daunting but in R its a snap.
The Data
To keep things simple, let’s create a very simply dataset that will let you see how the different functions work.
emp_id <- 1:12 # employee number
leader <- c(rep("Pickett", 4), rep("Sledge", 4), rep("Franklin", 4))
yrs <- c(1,2, 3, 24, 3, 4, 5, 6, 5, 6, 12, 12) #years experience
d <- data.frame(emp_id, leader, yrs)
Team Data with the Aggregate Function
The aggregate is a foundational tool for HR Analytics. It works by splitting one variable up according to the values of some other variable. Then it applies a function (such as the mean) to each of those pieces separately.
This is something that you might do in an Excel pivot table but the ability to smoothly integrate it other aspects of our analysis in R makes it really powerful.
Writing the Aggregate Statement
In the current case, we are “modeling” (that is, breaking down) our our variable for years of experience (yrs) as a function of leader. Then, we get the mean of that years variable within each of the separate teams. Note that here we are using a formula format and specifying the dataset.
team_avg <- aggregate(yrs ~ leader, data = d, FUN = mean)
team_avg
## leader yrs
## 1 Franklin 8.75
## 2 Pickett 7.50
## 3 Sledge 4.50
Merging the Aggregate Result
Now that we have our team-level years of experience value, we want to add that back to our original data. We’ll first rename the “yrs” variable from our team data so we know what it represents.
Then we’ll merge our original data with the team data, matching on the “leader” variable.
names(team_avg)[2] <- "team_yrs"
new_d <- merge(d, team_avg, by = "leader")
new_d # It works!
## leader emp_id yrs team_yrs
## 1 Franklin 9 5 8.75
## 2 Franklin 10 6 8.75
## 3 Franklin 11 12 8.75
## 4 Franklin 12 12 8.75
## 5 Pickett 1 1 7.50
## 6 Pickett 2 2 7.50
## 7 Pickett 3 3 7.50
## 8 Pickett 4 24 7.50
## 9 Sledge 5 3 4.50
## 10 Sledge 6 4 4.50
## 11 Sledge 7 5 4.50
## 12 Sledge 8 6 4.50
Team Data with tapply
Writing the tapply statement
Another approach to getting team level data is with the tapply function. We can’t use the formula approach with tapply so here the X is the variable we want chop up and operate on. We’ll split the data up again by leader, specified this time as the INDEX argument.
team_avg <- tapply(X = d$yrs, INDEX = d$leader, FUN = mean)
team_avg # It works!
## Franklin Pickett Sledge
## 8.75 7.50 4.50
Reshaping tapply output
The data look good but there is no “leader” field here. That will be a problem later when we try to merge it with our original data.
To fix this we’ll use the melt function from the reshape2 library.
# install.packages("reshape2") # Run this if you don't have the library installed
library(reshape2) # load the library
## Warning: package 'reshape2' was built under R version 3.2.5
team_avg <- melt(team_avg) # melt the data into a "long" format
team_avg # It works!
## Var1 value
## 1 Franklin 8.75
## 2 Pickett 7.50
## 3 Sledge 4.50
No we just need change the names of the variables
# Note: I did this in 2 separate steps so it's clear what's going on
names(team_avg)[1] <- "leader"
names(team_avg)[2] <- "team_yrs"
Merging the tapply data
Time to merge the team-level data with the original. We can use the same process as before.
new_d <- merge(d, team_avg, by = "leader")
new_d # It works!
## leader emp_id yrs team_yrs
## 1 Franklin 9 5 8.75
## 2 Franklin 10 6 8.75
## 3 Franklin 11 12 8.75
## 4 Franklin 12 12 8.75
## 5 Pickett 1 1 7.50
## 6 Pickett 2 2 7.50
## 7 Pickett 3 3 7.50
## 8 Pickett 4 24 7.50
## 9 Sledge 5 3 4.50
## 10 Sledge 6 4 4.50
## 11 Sledge 7 5 4.50
## 12 Sledge 8 6 4.50
Team Data with merge and ddply
The aggregate and tapply functions both work well for creating team-level data.
But one thing that you might care about is filtering out a person so that the team-level data is about one’s COLLEAGUES and does not include the data for that given person.
Consider for example, those under Franklin. While they have an average of 7.5 years of experience we can see that one person has 24 years. If want to know about the experience level of one’s colleagues, it we might want to correct for that.
This will require a slightly more complicated process. In a more realistic setting with larger teams, this might not matter that much. Still, I wanted to show you how to handle it in case you are a stickler for details.
There are two basic steps. First, we’ll merge the data with itself so we can associate each employee with each member of his/ her team. Then, well use the ddply function to calculate the mean years of experience for each employees colleagues (that is, excluding them from the calculations for their teams)
Step 1: Merging the Data…with itself
I am not sure if this is an elegant solution or not, but it work and it’s simple.
We’ll merge the data with itself. This will give us a row of data for each person with each person who is on their team.
d2 <- merge(d, d, by = "leader")
# examining first rows
# notice how each person is paired with someone on their team in each line
head(d2, 16)
## leader emp_id.x yrs.x emp_id.y yrs.y
## 1 Franklin 9 5 9 5
## 2 Franklin 9 5 10 6
## 3 Franklin 9 5 11 12
## 4 Franklin 9 5 12 12
## 5 Franklin 10 6 9 5
## 6 Franklin 10 6 10 6
## 7 Franklin 10 6 11 12
## 8 Franklin 10 6 12 12
## 9 Franklin 11 12 9 5
## 10 Franklin 11 12 10 6
## 11 Franklin 11 12 11 12
## 12 Franklin 11 12 12 12
## 13 Franklin 12 12 9 5
## 14 Franklin 12 12 10 6
## 15 Franklin 12 12 11 12
## 16 Franklin 12 12 12 12
So far so good. Now we just need to drop the lines where employees are paired with themselves. We’ll do this by subsetting our data to only those cases where the emp_id within a single line don’t match. That way we’ll end up with three colleagues for each person (as it should be for this dataset)
d2 <- subset(d2, emp_id.x != emp_id.y)
head(d2, 16) # the self-matches are gone!
## leader emp_id.x yrs.x emp_id.y yrs.y
## 2 Franklin 9 5 10 6
## 3 Franklin 9 5 11 12
## 4 Franklin 9 5 12 12
## 5 Franklin 10 6 9 5
## 7 Franklin 10 6 11 12
## 8 Franklin 10 6 12 12
## 9 Franklin 11 12 9 5
## 10 Franklin 11 12 10 6
## 12 Franklin 11 12 12 12
## 13 Franklin 12 12 9 5
## 14 Franklin 12 12 10 6
## 15 Franklin 12 12 11 12
## 18 Pickett 1 1 2 2
## 19 Pickett 1 1 3 3
## 20 Pickett 1 1 4 24
## 21 Pickett 2 2 1 1
Step 2: Using the ddply function
ddply comes from the plyr package. I’m going to just touch on the basics so you know that it exists and what it does, but if you want to go into more details of this powerful package check out this post.
In essence the ddply function take a dataframe, splits the data up by some set of variables, and runs a function each of the pieces, and the then returns a dataframe result. It’s called “dd” because it takes and returns a dataframe.
In the present case, we will split the our data by individual employee, calculate the mean years of experience for each of the colleagues, and then return the result.
You can achieve this same ends using the aggregate function actually, but I think it’s important that you get to know some other functions too.
# install.packages("plyr") # run this if you don't the library
library(plyr)
## Warning: package 'plyr' was built under R version 3.2.5
d3 <- ddply(.data = d2, .variables = "emp_id.x", summarise, mean(yrs.y))
head(d3) # the data looks good
## emp_id.x ..1
## 1 1 9.666667
## 2 2 9.333333
## 3 3 9.000000
## 4 4 2.000000
## 5 5 5.000000
## 6 6 4.666667
Now let’s fix those variable name and merge our results, this time by the individual employee.
names(d3)[1:2] <- c("emp_id", "coll_avg")
d3 # much better
## emp_id coll_avg
## 1 1 9.666667
## 2 2 9.333333
## 3 3 9.000000
## 4 4 2.000000
## 5 5 5.000000
## 6 6 4.666667
## 7 7 4.333333
## 8 8 4.000000
## 9 9 10.000000
## 10 10 9.666667
## 11 11 7.666667
## 12 12 7.666667
d <- merge(d, d3, by = "emp_id")
d # it works!
## emp_id leader yrs coll_avg
## 1 1 Pickett 1 9.666667
## 2 2 Pickett 2 9.333333
## 3 3 Pickett 3 9.000000
## 4 4 Pickett 24 2.000000
## 5 5 Sledge 3 5.000000
## 6 6 Sledge 4 4.666667
## 7 7 Sledge 5 4.333333
## 8 8 Sledge 6 4.000000
## 9 9 Franklin 5 10.000000
## 10 10 Franklin 6 9.666667
## 11 11 Franklin 12 7.666667
## 12 12 Franklin 12 7.666667
Notice the big change for employee #4. Previously, all those of experience were included as part of his/her the team average. Now that we are just measuring the experience of ones’ colleagues, we can see the other members of the team are extremely junior.
Conclusion
Remembering the role of context in performance and prediction is a big part of quality HR analytics. Using functions like aggregate, tapply, and ddply as a we have here gives you the tools you need to make happen with very little work.
Like this post?
Get our FREE Turnover Mini Course!
You’ll get 5 insight-rich daily lessons delivered right to your inbox.
In this series you’ll discover:
- How to calculate this critical HR metric
- How turnover can actually be a GOOD thing for your organization
- How to develop your own LEADING INDICATORS
- Other insightful workforce metrics to use today
There’s a bunch more too. All free. All digestible. Right to your inbox.
Yes! Sign Me Up!
Comments or Questions?
Add your comments OR just send me an email: john@hranalytics101.com
I would be happy to answer them!
Contact Us
- © 2023 HR Analytics 101
- Privacy Policy