HR Analytics Project EEB

HR Analytics: Why are our best and most experienced employees leaving prematurely?

Erik Bebernes

Introduction This project uses a dataset I found on kaggle, where a company has been experiencing difficulty retaining their best and most experienced employees. The data frame consists of 15,000 observations of 10 variables, which are: names(hr) [1] "satisfaction_level" "last_evaluation" "number_project" [4] "average_montly_hours" "time_spend_company" "Work_accident" [7] "left" "promotion_last_5years" "sales" [10] "salary" Satisfaction Level – employees overall job satisfaction level based on a survey Last Evaluation – employees performance score given by their manager Number of projects – how many projects an employee has been involved in Average monthly hours-‐ mean hours worked by employee per month Time spend company – years employee has worked for the company Work accident – binary variable indicating if 1, the employee has had an accident in the workplace Left-‐ indicated if 1, the employee has left or 0, the employee is still at the company Promotion last 5 years – binary variable signaling if the employee has been promoted Sales-‐ categorical variable on job type Salary-‐ categorical variable (low, medium, high) of how much the employee is paid annually My approach to this project can be summarized in the following steps:

1.) Clean and structure the data set, including imputing missing values if necessary 2.) Create subsets between the best employees that left and stayed 3.) Create discrete factor variables and perform association rules analysis 4.) Classify employees through decision tree analysis 5.) Find any significant correlations, and differences in correlations between said subsets. 6.) Exploratory visualization analysis in an attempt to explain any discrepancies in

correlations. 7.) Run a random forest algorithm to confirm significant relationships between the

variables, as well as a logistic regression 8.) Provide conclusions and recommendations for management

HR_comma_sep <-‐ read.csv("~/Downloads/HR_comma_sep.csv", header=TRUE) View(HR_comma_sep) hr<-‐HR_comma_sep

Cleaning and structuring the dataset At first glance the dataset seems clean, but to make sure I’m going to use the “amelia” package to identify any missingness. library(Amelia) missmap(hr)

This shows that there is no missing data. > str(hr) 'data.frame': 14999 obs. of 10 variables: $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ... $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ... $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...

$ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ... $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ... $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ... $ left : int 1 1 1 1 1 1 1 1 1 1 ... $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ... $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ... $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ... Subsets hrbestleft<-‐hr[which(hr$Last_eval>.72 & hr$Left == 1),] #employees with high evaluations and who left the company hrbeststay<-‐hr[which(hr$Last_eval>.72 & hr$Left == '0'),] #employees with high evaluations that left the company Creating Discrete Variables and Association Rules Analysis quantile(hr$average_montly_hours, .33) quantile(hr$average_montly_hours, .67) hr$Hours_Discrete[hr$average_montly_hours <= 69]<-‐ 'low' hr$Hours_Discrete[hr$average_montly_hours >69 & hr$average_montly_hours < 134]<-‐ 'average' hr$Hours_Discrete[hr$average_montly_hours >=134]<-‐ 'high' quantile(hr$satisfaction_level, .33) quantile(hr$satisfaction_level, .67) quantile(hr$satisfaction_level, .8) hr$Sat_Discrete[hr$satisfaction_level <= 43]<-‐ 'low' hr$Sat_Discrete[hr$satisfaction_level >43 & hr$satisfaction_level < 68]<-‐ 'average' hr$Sat_Discrete[hr$satisfaction_level >=68]<-‐ 'high' library(arules) hr$Work_accident<-‐as.factor(hr$Work_accident) hr$left<-‐as.factor(hr$left) hr$promotion_last_5years<-‐as.factor(hr$promotion_last_5years) hr$Hours_Discrete<-‐as.factor(hr$Hours_Discrete) hr$Sat_Discrete<-‐as.factor(hr$Sat_Discrete) names(hr) hrassoc<-‐hr[,c(6,7,8,9,10,11,12)] rules<-‐apriori(hrassoc, parameter = list(support = .2, confidence = .7))

#since the majority of employees haven't left, it will be a good idea to reduce support and increase confidence rules<-‐apriori(hrassoc, parameter = list(support = .05, confidence = .95)) #still not getting any interesting rules, so I'll make a new dataset with only left =1 hrleft<-‐hr[which(hrassoc$left==1),] hrleft<-‐hrleft[,c(6:12)] rules<-‐apriori(hrleft, parameter = list(support = .3, confidence = 1)) inspect(rules) lhs rhs support confidence lift [1] {} => {left=1} 1.0000000 1 1 [2] {} => {Sat_Discrete=low} 1.0000000 1 1 [3] {salary=medium} => {left=1} 0.3688043 1 1 [4] {salary=medium} => {Sat_Discrete=low} 0.3688043 1 1 [5] {salary=low} => {left=1} 0.6082330 1 1 [6] {salary=low} => {Sat_Discrete=low} 0.6082330 1 1 [7] {Hours_Discrete=high} => {left=1} 0.9106693 1 1 [8] {Hours_Discrete=high} => {Sat_Discrete=low} 0.9106693 1 1 [9] {Work_accident=0} => {left=1} 0.9526743 1 1 [10] {Work_accident=0} => {Sat_Discrete=low} 0.9526743 1 1 [11] {promotion_last_5years=0} => {left=1} 0.9946794 1 1 [12] {promotion_last_5years=0} => {Sat_Discrete=low} 0.9946794 1 1 [13] {left=1} => {Sat_Discrete=low} 1.0000000 1 1 [14] {Sat_Discrete=low} => {left=1} 1.0000000 1 1 [15] {salary=medium, Hours_Discrete=high} => {left=1} 0.3385606 1 1 [16] {salary=medium, Hours_Discrete=high} => {Sat_Discrete=low} 0.3385606 1 1 [17] {Work_accident=0, salary=medium} => {left=1} 0.3480818 1 1 [18] {Work_accident=0, salary=medium} => {Sat_Discrete=low} 0.3480818 1 1 [19] {promotion_last_5years=0, salary=medium} => {left=1} 0.3674041 1 1 [20] {promotion_last_5years=0, salary=medium} => {Sat_Discrete=low} 0.3674041 1 1 [21] {left=1, salary=medium} => {Sat_Discrete=low} 0.3688043 1 1 [22] {salary=medium, Sat_Discrete=low} => {left=1} 0.3688043 1 1 [23] {salary=low,

Hours_Discrete=high} => {left=1} 0.5527863 1 1 [24] {salary=low, Hours_Discrete=high} => {Sat_Discrete=low} 0.5527863 1 1 [25] {Work_accident=0, salary=low} => {left=1} 0.5816298 1 1 [26] {Work_accident=0, salary=low} => {Sat_Discrete=low} 0.5816298 1 1 [27] {promotion_last_5years=0, salary=low} => {left=1} 0.6043125 1 1 [28] {promotion_last_5years=0, salary=low} => {Sat_Discrete=low} 0.6043125 1 1 [29] {left=1, salary=low} => {Sat_Discrete=low} 0.6082330 1 1 [30] {salary=low, Sat_Discrete=low} => {left=1} 0.6082330 1 1 Most Interesting rules: 1.) of the people who left, 99% never received a promotion 2.) 95% never had an accident 3.) 60% were low salary 4.) 100% had low job satisfaction These rules signify a few important relationships between the variables that may explain why some employees are leaving. Of the employees who left, 99% never had an accident, 60% were low salary and an astonishing 100% had low job satisfaction. This must mean satisfaction is significant in determining leaving vs. staying. Next I’m going to look at correlations between satisfaction and the numeric variables. Correlation Analysis Using all employees in the dataset: cor(hr[,1:5]) satisfaction_level last_evaluation number_project average_montly_hours satisfaction_level 1.00000000 0.1050212 -‐0.1429696 -‐0.02004811 last_evaluation 0.10502121 1.0000000 0.3493326 0.33974180 number_project -‐0.14296959 0.3493326 1.0000000 0.41721063 average_montly_hours -‐0.02004811 0.3397418 0.4172106 1.00000000 time_spend_company -‐0.10086607 0.1315907 0.1967859 0.12775491 time_spend_company satisfaction_level -‐0.1008661 last_evaluation 0.1315907 number_project 0.1967859

average_montly_hours 0.1277549 time_spend_company 1.0000000

The above plot and output shows correlations between numeric variables of all employees. Managers seem to give higher evaluation scores to employees who work more hours and who have more projects, however there is a negative correlation between employee satisfaction and number of projects. It should be interesting to see how this compares to correlations using just the best employees. Correlations using just the best employees and most experienced employees that left: > hrbestleft<-‐hr[which(hr$last_evaluation >= .72 & hr$left == 1),] > cor(hrbestleft[,1:5])

satisfaction_level last_evaluation number_project satisfaction_level 1.0000000 0.3611564 -‐0.7370609 last_evaluation 0.3611564 1.0000000 -‐0.2150533 number_project -‐0.7370609 -‐0.2150533 1.0000000 average_montly_hours -‐0.4771749 -‐0.1261519 0.5217016 time_spend_company 0.6582700 0.3147566 -‐0.3644283 average_montly_hours time_spend_company satisfaction_level -‐0.4771749 0.6582700 last_evaluation -‐0.1261519 0.3147566 number_project 0.5217016 -‐0.3644283 average_montly_hours 1.0000000 -‐0.1572702 time_spend_company -‐0.1572702 1.0000000

There are some very notable differences here, including the massive negative correlations between number of projects and satisfaction level and the large negative correlation between average monthly hours and satisfaction level. This probably means that managers are overworking their best employees, which leads to lower satisfaction levels. It’s worth looking at

the data visually to see if this is in fact the case. I’ll also run a decision tree analysis which may serve as a confirmation. Interpreting Correlation Differences Visually Do the best employees work more hours?

Comparing these histograms, it’s clear that employees that score higher on manager evaluations are working considerably more hours than the workforce as a whole. Do the best employees work on more projects?

Yes, the best employees usually have more projects. There is a downward trend as the number of projects increase when you look at the workforce as a whole, and the opposite can almost be said for the best employees (until you get to 6 projects). Have the best employees been working at the company for a longer period of time?

Almost all of the best employees have been at the company for at least four years, perhaps this can be related to “learning by doing.” It’s also a sufficient amount of time to prove to managers that they are high performing. The dataset as a whole shows that there are an abundance of employees who have been there for 2 and 3 years. Let’s see if anyone is being promoted.

As you can see above, of the best performing employees…hardly any of them have been promoted in the last five years. In fact, it’s only .2%. It must be discouraging to these employees to be highly evaluated and not be rewarded for it. Next I’m going to look at the relationship between job type and salary. Are there noticeable differences in pay between different departments of the company? And how many employees are in each department?

A couple of things I noticed while looking at this graph are that a majority of the good employees are on the low end of the salary spectrum and most of them are working in sales, support in technical roles. However I made the same graph using the dataset as a whole and didn’t see much of a difference, so I’ll put these observations aside for now. As I mentioned earlier during my association rules analysis, satisfaction is most likely significant in determining why the best employees are leaving. The plot below is an attempt to see that relationship visually, where the green density is the subset of the best employees that left, the red density are the best employees that have stayed, and the blue density is the entire dataset.

p1<-‐ggplot()+geom_density(data = hrbestleft, aes(satisfaction_level), fill = 'green', alpha = .3)+ geom_density(data = hrbeststay, aes(satisfaction_level), fill = 'red', alpha = .3)+ geom_density(data = hr, aes(satisfaction_level), fill = 'blue', alpha = .3)+theme_light(base_size = 16)+xlab("Satisfaction Level")+ylab("")+ ggtitle("Satisfaction Levels of Subsets")

The best employees that left (green) is what really stands out here. Many of them have very low satisfaction levels (<.25), then there is a lull, and then another group with satisfaction levels greater than .6. It’s difficult to say why this might be. Perhaps there is a difference in how the employees interpret satisfaction. It’s possible that they still enjoyed their job despite being over worked and not being promoted. I think the best way to figure this out is through a decision tree analysis, where those who left will be classified more accurately. But first, I want to combine average monthly hours and satisfaction into a plot. Since I noticed earlier that the good employees that left were working a lot more hours, there should be a strong relationship between the two. plot6<-‐ggplot(hr, aes(satisfaction_level, average_montly_hours, color = left, alpha = .3))+geom_point()+ggtitle("Hours and Satisfaction")

These distributions are very tight, which tells me that the decision tree will be a great addition to my analysis. The blue box must be underperforming employees, those that have not been working many hours and aren’t that satisfied. Where the other two blue distributions, judging by the density plots on the previous page, are high performing employees. My next plot is another confirmation of that hypothesis, but this time I’m adding years spent at the company.

The cluster on the right has a lot of employees that have been at the company for a long time, I think the lack of promotions may have something to do with them leaving. Decision Tree Analysis Decision trees are best used on small datasets, so in order to get a few simple rules (and to avoid over-‐fitting the model) I made a small sample of the data (2%). install.packages("party") library(party) set.seed(421) ind<-‐sample(2, nrow(hr), replace = TRUE, prob = c(0.02,0.3)) traindata<-‐hr[ind==1,] testdata<-‐hr[ind==2,] form<-‐left~satisfaction_level+average_montly_hours+time_spend_company+last_evaluation hrtree<-‐ctree(form, data = traindata, controls = ctree_control(maxsurrogate = 3)) table(predict(hrtree), traindata$left) plot(hrtree, type = "simple") ?ctree print(hrtree)

Using the variables time spent at company, satisfaction, average monthly hours and last evaluation (what I think are the most important variables based on the visualizations I made) I was able to come up with a few rules that help classify employees into the leaving and staying categories. Here are my key takeaways: 1.) Employees with low satisfaction levels, but haven’t been at the company long will generally

stay.

satisfaction_levelp < 0.001

1

≤ 0.46 > 0.46

time_spend_companyp < 0.001

2

≤ 4 > 4

time_spend_companyp = 0.001

3

≤ 2 > 2

n = 21y = (0.952, 0.048)

4n = 217

y = (0.258, 0.742)

5

n = 46y = (0.891, 0.109)

6

time_spend_companyp < 0.001

7

≤ 4 > 4

n = 562y = (0.984, 0.016)

8last_evaluation

p < 0.001

9

≤ 0.8 > 0.8

n = 61y = (0.951, 0.049)

10average_montly_hours

p < 0.001

11

≤ 216 > 216

n = 18y = (1, 0)

12time_spend_company

p = 0.001

13

≤ 5 > 5

n = 37y = (0.081, 0.919)

14n = 22

y = (0.273, 0.727)

15

2.) Employees with low satisfaction levels and who have been at the company between 2 and 5 years leave.

3.) Employees with high satisfaction levels who have been working for less than or equal to 4 years stay.

4.) High performing employees with high satisfaction and who have been at the company >4 years leave when they are working too many hours.

This analysis is 91.5% accurate, which is pretty good considering how simple the tree is. If I were to show management one graph it would be this, it identifies clear cut patterns and confirms much of what I had been hypothesizing with my previous analyses. Random Forest and Logistic Regression Before offering my final advice to management, I want to see how accurately I can predict who is going to leave. An accurate machine learning algorithm will allow the company to focus on specific employees…perhaps offering them a raise or reducing their hours before they decide to leave. First I’m going to try a logistic regression, which determines probabilities of a binary dependent variable for each observation. Any probability greater than .5 will mean the employee will leave. Let’s see how it goes: Logistic Regression: #creating a test and training set using dplyr set.seed(142) train<-‐sample_frac(hr, .7) sid<-‐as.numeric(rownames(train)) test<-‐hr[-‐sid,] fitted.results<-‐predict(glmmodel, newdata = test, type = "response") #type = response converts logits to predicted probabilities new<-‐mutate(test, fitted.results) predicted.to.leave<-‐filter(new, fitted.results > .5) predicted.to.stay<-‐filter(new, fitted.results < .5) View(predicted.to.stay) summary(predicted.to.stay$left) summary(predicted.to.leave$left) The model ended up being only 79.4% accurate. Which is okay, but considering the decision tree was 91%, I think I can come up with a better model. Random forest works by averaging the results of many decision trees and can work very well. Let’s try that: randindex<-‐ sample(1:dim(hr)[1]) cutpoint2_3<-‐floor(2*dim(hr)[1]/3)

traindata<-‐hr[randindex[1:cutpoint2_3],] testdata<-‐ hr[randindex[(cutpoint2_3+1):dim(hr)[1]],] library(randomForest) rfmodel <-‐ randomForest(factor(left) ~ satisfaction_level + number_project + average_montly_hours + time_spend_company + promotion_last_5years + last_evaluation, data = traindata) plot9<-‐plot(rfmodel, ylim=c(0,0.36))

The false positive and false negative errors are very low, which is a good sign. Let’s see how accurate the model is when I try it on a test set. prediction<-‐predict(rfmodel, testdata) confusionMatrix(prediction, testdata$left) Confusion Matrix and Statistics

Reference Prediction 0 1 0 3786 48 1 10 1156 Accuracy : 0.9884 95% CI : (0.985, 0.9912) No Information Rate : 0.7592 P-‐Value [Acc > NIR] : < 2.2e-‐16 Kappa : 0.9679 Mcnemar's Test P-‐Value : 1.184e-‐06 Sensitivity : 0.9974 Specificity : 0.9601 Pos Pred Value : 0.9875 Neg Pred Value : 0.9914 Prevalence : 0.7592 Detection Rate : 0.7572 Detection Prevalence : 0.7668 Balanced Accuracy : 0.9787 'Positive' Class : 0 The model is 98.84% accurate, this will prove to be very beneficial in identifying employees that are likely to be leaving in the future. What variables are most important in leaving vs. staying? importance(rfmodel) MeanDecreaseGini satisfaction_level 1226.048093 number_project 665.390311 average_montly_hours 536.922188 time_spend_company 664.193153 promotion_last_5years 4.487941 last_evaluation 430.694068 According to the random forest model, satisfaction, number of projects and time spent at the company are the three most significant variables.

Conclusion and Recommendations I very much enjoyed learning more about this dataset. I performed so many types of analyses because retaining a company’s best employees is extremely important. High turnover is costly, and if a company wants to grow you need the right people leading the way. I’ve worked for organizations in the past that have had high turnover rates, and while you want underperforming employees to leave, you want your best workers to grow with you. What I found most useful in this project were visualizations, the decision tree and the random forest algorithm. They all can be used in different ways. If management wants a basic understanding of what’s going on, I would show them the visuals, if they want to know what patterns are harming them, I would go over the decision tree, and if they want to know what employees will leave in the future, the random forest model would be helpful. Based on all of those, here are the two key points management should know concerning why their best and most experienced employees are leaving prematurely:

1.) They are being overworked – it’s common for managers to take advantage of employees who do a good job by giving them a heavier workload. This is costing the company, because they are deciding to leave.

2.) They aren’t being promoted-‐ good employees expect to be rewarded. There is a large group of employees with high satisfaction levels who have been at the company for more than four years, but they decided to leave because there isn’t any career growth.

There are a couple of simple, obvious actions management can take. They shouldn’t work their best employees more than anyone else, and they should be promoted after 3 or 4 years. In time, I think they will find that although the company will be less productive in the short run, reducing their turnover rate of their best employees will lead to incremental growth.

HR Analytics Project EEB

Data & Analytics

Transcript of HR Analytics Project EEB