Efficient Data Mining Model for Employees Churn Prediction and Safety Measure

Employee Churn which is otherwise called representative turnover is an exorbitant issue for organizations. The genuine expense for supplanting a worker can frequently be very enormous. In this work, we aimed to understand why and when employees are most likely to leave a company i.e the probability of an active employee leaving the organization and the key factors of an employee leaving the organization. For this purpose, we created such standard dataset where we include those attributes that are helpful for our analysis to predict the factors that are responsible for an employee to leave a company. The attributes we used in the dataset are satisfaction level, last evaluation, a number of projects, monthly average hours, amount of time spend in the company, employees left the company, promotions in last 5years, departments, salary. Further, under these attributes, we include 603 data samples. It is also useful to the company to retain the employees' safety and secure without losing them in the organization for a long time. We applied various Machine Learning models such as, Logistic Regression Classifier, Random Forest Classifier, SVM to check that our dataset is resulting with accurate values or not and which model is predicting the best. Thus, after applying all the models to the dataset, the Random Forest Classifier is giving more accuracy that is about 97.2% when compared to all the other classification models. This Random Forest Classifier correctly depicts the factors responsible for an employee leaving the company.

ISSN:00333077 1963 www.psychologyandeducation.net company. We use machine learning models to predict the factors responsible for an employee leaving the company and also to find the accuracy of the project.
So, finally this project can give an idea or an input to the company as to what are the steps to be taken to retain the employees and who are at high risk of leaving.

2.Literature Survey
Employee churn can be determined as leak or departure of creative central from the company [2]. The analysis is said as voluntary turnover. The analysis on voluntary turnover studies [3]. High employee turnover or employee churn has very severe effect on the organization. It is difficult to replace employees with same skill and same talent; it also affects the ongoing work in the company and the productivity of the existing employees in the company. Acquiring new employees in the company has its own cost such as hiring cost and training cost of the company [8]. Organization will take this problem into consideration and will solve this problem by applying different machine learning algorithms to predict the employee churn and also will make them to take the necessary actions required to retain employee churn [13][14][15][16][17][18][19][20][21][22][23][24][25][26].

3.Proposed Methodology
First here we have a problem statement; we will be creating a dataset to our project by collecting the data. The data is stored as a csv (comma separated values) file. After creating the data we will be sending the data to the data preprocessing.
Here we will be carrying out our project in two phases. In the first phase we will be applying the basic methods such as Data Visualization, Cluster Analysis, and Correlation Analysis. By applying these methods we can draw the conclusions like what are the factors that are responsible an employee leaving the company.
In the second phase, after predicting the factors that are responsible for an employee to leave a company we are going to check how accurate are those, First to check that we should split some of the data to the training phase and testing phase. Here we are sending 70% of the data to the training phase and remaining 30% to the testing phase. Here splitting the data into training phase and testing phase we are going to find the accuracy. We will find accuracy using three methods namely Logistic Regression Classifier, Random Forest Classifier, SVM (Support Vector Machine) Classifier. Now we will compare the accuracy obtained from the three methods and by comparing we get the best accuracy for the Random Forest Classifier and we will be declared it as the best model.

Data Pre-processing
Datasets in any data mining applications can have missing data values. These missing values can get propagated due to lack of communication among the parameters in a data collection system. These missing values can affect the performance of a data mining system, and it should be noticed.

Phase 1:
The methods that we applied in phase 1 are Data Visualization, Cluster Analysis, and Correlation Analysis.

Data Visualization:
Data Visualization is the visual elements like charts,

Correlation Analysis:
A Correlation is a number between -1 and +1 that measures the degree of association between two attributes (call them as X and Y).

Cluster Analysis:
Clustering is an example for unsupervised learning.
By analyzing the relationship between the data parts we need to make it a group which have similar relationships .

Fig. 7. Cluster analysis
Here by using cluster analysis method for our employee churn prediction project we draw some conclusions like: There are three categories of employees: Employees with high satisfaction and high performance.
Employees with low satisfaction and high performance Employees with low satisfaction and low performance Here in our project we use K-means Clustering and here k value = 3 Here we apply cluster analysis on the satisfaction level of employees who left the company.
Hence, these are methods we applied in the phase 1.
By applying these methods such as Data Visualization, Correlation Analysis and Cluster Analysis, we are going to predict the factors that are responsible for an employee leaving the company.
Hence, in phase 1, the factors are predicted and then we move to phase2.

Phase 2:
Here in phase 1, we predicted the model and in phase 2 we are going to check how accurate the model is and which is the best model. In this, first we need to split the data into training test and testing set.

Training and Testing a model:
Training and testing means that we need to send some of our code to training phase and testing phase. Here we are going to test our model. Here we split some our data to training phase and testing phase. Here we give more amount of our data to training phase and less amount of data to testing phase.

Logistic Regression Classifier:
Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Now, Here we have the input and output prepared. We are going to create and define our classification model and  Here we can observe the parameters of the logistic Regression Classifier and by using these parameters we can know the nature and behavior of the model.
Here, the accuracy that we got using logistic Regression Classifier is 0.773.

Random Forest Classifier:
A random forest is a Meta estimator that fits various