When Models Lie: Why Plotting Your Data Is Super-Duper Critical
In HR analytics, a plot is worth a thousand models.
Today’s post is based on the work of F.J. Anscombe and it screams one simple thing: always plot your data first.
The Data
We’ll start by loading the Anscombe dataset that comes with R.
# library(help = "datasets") # for a complete list of other datasets available
data(anscombe)
The Models All Say the Same Thing…
Next, we’ll do the very thing that you should NEVER do: run models without first plotting your data.
For this, we’ll just use the handy code included in the help for anscombe. I’ve pasted it below with a few slight modifications. Note: If you find this R code hard to understand, just skip to the end of this post for code that is much easier to follow if you are just starting out in R.
# help(anscombe) # for the basic code I used here.
# looping through and creating a set of regression models
# I have provided simpler code at the end if you are new to r
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
## or ff[[2]] <- as.name(paste0("y", i))
## ff[[3]] <- as.name(paste0("x", i))
mods[[i]] <- lmi <- lm(ff, data = anscombe)
print(anova(lmi))
print(paste("r squared:", round(summary(lmi)$r.squared, 3)))
}
## Analysis of Variance Table
##
## Response: y1
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 27.510 27.5100 17.99 0.00217 **
## Residuals 9 13.763 1.5292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "r squared: 0.667"
## Analysis of Variance Table
##
## Response: y2
## Df Sum Sq Mean Sq F value Pr(>F)
## x2 1 27.500 27.5000 17.966 0.002179 **
## Residuals 9 13.776 1.5307
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "r squared: 0.666"
## Analysis of Variance Table
##
## Response: y3
## Df Sum Sq Mean Sq F value Pr(>F)
## x3 1 27.470 27.4700 17.972 0.002176 **
## Residuals 9 13.756 1.5285
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "r squared: 0.666"
## Analysis of Variance Table
##
## Response: y4
## Df Sum Sq Mean Sq F value Pr(>F)
## x4 1 27.490 27.4900 18.003 0.002165 **
## Residuals 9 13.742 1.5269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "r squared: 0.667"
If you are familiar with regression, you’ll notice that the results across our four linear regression models are all virtually identical.
If you are not, then just observe that all the F-values and p-values are essentially the same and they all uniformly signal that there is a statistically significant relationship between x and y. Moreover, that R-squared value you see tells us that our x values are accounting for 2/3 of the differences in y. That’s REALLY good.
The intercepts and coefficients are also the same. This means that the regression lines themselves are absolutely 100% the same.
More evidence that there is nothing to see here.
## See how close they are (numerically!)
sapply(mods, coef)
## lm1 lm2 lm3 lm4
## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273
## x1 0.5000909 0.500000 0.4997273 0.4999091
lapply(mods, function(fm) coef(summary(fm)))
## $lm1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0000909 1.1247468 2.667348 0.025734051
## x1 0.5000909 0.1179055 4.241455 0.002169629
##
## $lm2
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000909 1.1253024 2.666758 0.025758941
## x2 0.500000 0.1179637 4.238590 0.002178816
##
## $lm3
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0024545 1.1244812 2.670080 0.025619109
## x3 0.4997273 0.1178777 4.239372 0.002176305
##
## $lm4
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017273 1.1239211 2.670763 0.025590425
## x4 0.4999091 0.1178189 4.243028 0.002164602
By the normal, quick and dirty modeling approaches often used in HR Analytics and elsewhere, we have four more or less identical linear regression models with significant predictors and high R-squareds.
…But a Plot is Worth a Thousand Models
Lo and behold! When we actually PLOT the data, we see an entirely different picture.
## Now, do what you should have done in the first place: PLOTS
par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma = c(0, 0, 2, 0))
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
xlim = c(3, 19), ylim = c(3, 13))
abline(mods[[i]], col = "blue")
}
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)
If we just did our models and trusted the numbers by themselves, we would’ve walked away feeling pretty good.
But while the model results capture the essence of the first plot, they fail to communicate the real story for the others.
Whether it’s the curvilinear relationship in plot 2, the outlier in plot 3, or the restricted values in our X variable in plot 4, the pictures tell the true tale.
Summary
There is no reason to belabor the point. Whether you are just reporting basic means for monthly turnover, producing some bar graphs summarizing employmee movement, or using models to predict future talent sources, it’s CRITICAL that you plot your data first.
Reference
Anscombe’s orginal paper: http://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf
Appendix: Simpler Code Modeling and Plotting
Sometimes R people like to make compact code that is elegant but harder to understand. The code below does the same job as that above but just piece by piece. This is always the best way to learn R or any other language for that matter. You can always get more complicated later (although I usually don’t).
data(anscombe) # loading data
a <- anscombe #putting into a for easier reference
names(a) # just so you can see the names
## [1] "x1" "x2" "x3" "x4" "y1" "y2" "y3" "y4"
# Four Separate Models and Summaries
mod1 <- lm(y1 ~ x1, data = a)
summary(mod1)
##
## Call:
## lm(formula = y1 ~ x1, data = a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
mod2 <- lm(y2 ~ x2, data = a)
summary(mod2)
##
## Call:
## lm(formula = y2 ~ x2, data = a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## x2 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
mod3 <- lm(y3 ~ x3, data = a)
summary(mod1)
##
## Call:
## lm(formula = y1 ~ x1, data = a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
mod4 <- lm(y4 ~ x4, data = a)
summary(mod1)
##
## Call:
## lm(formula = y1 ~ x1, data = a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
# Four Separate Plots
# setting up 4 plots on a single screen and the margins
par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma = c(0, 0, 2, 0))
# Plot 1
plot(a$x1, a$y1, col = "red", pch = 21, bg = "orange", cex = 1.2,
xlim = c(3, 19), ylim = c(3, 13)) #plotting the data
abline(mod1) # adding the regression model line
# Plot 2
plot(a$x2, a$y2, col = "red", pch = 21, bg = "orange", cex = 1.2,
xlim = c(3, 19), ylim = c(3, 13)) #plotting the data
abline(mod2) # adding the regression model line
# Plot 3
plot(a$x3, a$y3, col = "red", pch = 21, bg = "orange", cex = 1.2,
xlim = c(3, 19), ylim = c(3, 13)) #plotting the data
abline(mod3) # adding the regression model line
# Plot 4
plot(a$x4, a$y4, col = "red", pch = 21, bg = "orange", cex = 1.2,
xlim = c(3, 19), ylim = c(3, 13)) #plotting the data
abline(mod4) # adding the regression model line
# Adding text at the top
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)
Like this post?
Get our FREE Turnover Mini Course!
You’ll get 5 insight-rich daily lessons delivered right to your inbox.
In this series you’ll discover:
- How to calculate this critical HR metric
- How turnover can actually be a GOOD thing for your organization
- How to develop your own LEADING INDICATORS
- Other insightful workforce metrics to use today
There’s a bunch more too. All free. All digestible. Right to your inbox.
Yes! Sign Me Up!
Comments or Questions?
Add your comments OR just send me an email: john@hranalytics101.com
I would be happy to answer them!
Contact Us
- © 2023 HR Analytics 101
- Privacy Policy