Explore how data science can be used to predict employee churn using this data science project presentation, allowing organizations to proactively address retention issues. This student presentation from Boston Institute of Analytics showcases the methodology, insights, and implications of predicting employee turnover. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
Predicting Employee Churn: A Data-Driven Approach Project Presentation
1. NAME: POOJA SHAH
Date of Assignment: 18/11/23
Date of Submission: 11/12/23
Project 2
Title: EMPLOYEE CHURN PREDICTION
2. Project Aim
To determine whether an employee will churn or not , as well as
the loss incurred if it does churn.
Create a system to prevent such churn for peaceful sustainability
of our company.
This capstone project aims to uncover the factors that lead to employee attrition and
explore important questions by developing an employee churn prediction system
3. Overview of
Project
Predicting employee churn involves using machine
learning models to forecast whether an employee is
likely to leave a company in the near future. This is
a crucial task for organizations as it allows them to
take preventive measures such as improving work
conditions, offering incentives, or providing career
development opportunities to retain valuable
employees.
4. Project Contents-
• Problem Formulation
• Data collection
• Importing libraries, loading and
understanding the data
• Exploratory Data Analysis
• Data Preprocessing
• Data Visualization
• Graphs Analysis
• Checking imbalance in dataset
• Balancing the data using SMOTE
• Feature Scaling
• Feature Extraction using PCA
• Model building & Evaluation
• Logistic Regression
• KNN
• Decision Tree Classifier
• Random Forest
• ADA Boost
• Support Vector Classifier
• Comparing different models
• Conclusion
5. Importing libraries, loading and
understanding the data-
• We will be using the following libraries
1) Pandas
2) Numpy
3) Seaborn
4) Matplotlib.pyplot
8. Exploratory Data Analysis
• Shape () -
With the help of shape attribute we can get to
know overall rows and columns in the data.
9. Exploratory Data Analysis
df.isnull() - creates a DataFrame of the
same shape as df, where each entry is True
if the corresponding element in df is NaN
(null), and False otherwise.
.sum() then calculates the sum of True
values along each column, resulting in a
Series that contains the total number of
missing values for each column.
.to_frame() converts the Series into a
DataFrame.
.rename(columns={0:"Total No. of
Missing Values"}) renames the column
containing the total number of missing
values to "Total No. of Missing Values."
missing_data["% of Missing Values"] =
df.isnull().mean()*100:
df.isnull().mean() calculates the proportion
of missing values for each column by taking
the mean (average) of the Boolean values in
the DataFrame. This gives the percentage of
missing values for each column.
*100 is then used to convert the proportions
into percentages.
The result is assigned to a new column in
the missing_data DataFrame called "% of
Missing Values."
11. Exploratory Data Analysis
• df.duplicated()
this method finds duplicate rows in data
• df.duplicated().mean()*100
It converts duplicate values into percentage
12. Exploratory Data Analysis
• column_data_types = df.dtypes:
df.dtypes returns a Series containing the data
type of each column in the DataFrame.
Counting numerical and categorical
columns:
This loop iterates through each column in the
DataFrame and checks its data type.
• np.issubdtype(data_type, np.number)
checks if the data type is a numerical type. If
true, it increments numerical_count; otherwise,
it increments categorical_count.
13. • describe().T –
• It generates descriptive statistics of the DataFrame's
numeric columns.
• .T It is transpose operation. It switches the rows
and columns of the result obtained from describe()
• Getting the Count: The number of non-null values in each
column.
• Mean: The average value of each column.
• Standard Deviation (std): It indicates how much individual
data points deviate from the mean.
• Minimum (min): The smallest value in each column.
• 25th Percentile (25%): Also known as the first quartile, it's
the value below which 25% of the data falls.
• Median (50%): Also known as the second quartile or the
median, it's the middle value when the data is sorted. It
represents the central tendency.
• 75th Percentile (75%): Also known as the third quartile,
it's the value below which 75% of the data falls.
• Maximum (max): The largest value in each column
14. Pre-Processing
• df.rename(columns={"Attrition":
"Employee_Churn"}, inplace=True)
The provided code is using the
rename method in pandas to rename a
column in a DataFrame.
• df.drop(columns=["Over18",
"EmployeeCount",
"EmployeeNumber",
"StandardHours"], inplace=True)
After executing this code, the
specified columns ("Over18",
"EmployeeCount",
"EmployeeNumber", and
"StandardHours") will be removed
from your DataFrame (df).
• df.columns
returns names of all columns
15. Pre-Processing
We will see the names of categorical columns and numerical columns in the DataFrame
printed to the console. This information can be helpful for further analysis, preprocessing, or
visualization tasks that may require handling different types of data separately.
16. Pre-Processing
This code is a common approach for identifying and handling outliers in a dataset using the
IQR method, and it also provides visualizations to assess the impact of the outlier handling
process. It ensures that extreme outliers do not unduly affect the analysis of the data.
The result is a grid of boxplots, where each subplot corresponds to a numerical column in the
DataFrame. This visualization is useful for understanding the distribution and variability of
values in each numerical feature.
17.
18.
19.
20. VISUALISATION – UNIVARIATE ANALYSIS – count plot & Pie Chart sub plot
• The result is a figure containing a count plot and a pie chart, both illustrating employee
churn in terms of counts and percentages, respectively. The count plot shows the
distribution of churn and non-churn instances, while the pie chart provides a visual
representation of the churn rate as a percentage.
21.
22. VISUALISATION – BIVARIATE ANALYSIS – count plot
• Bivariate analysis is a
statistical analysis
technique that involves the
examination of the
relationship between two
variables. It is often used to
understand how one
variable affects or is related
to another variable.
• We then create count plots
for 2 categorical variables
23.
24.
25.
26. VISUALISATION – BIVARIATE ANALYSIS – Hist Plot
• The provided code defines a function named hist_plot that creates a histogram with a kernel
density estimate (KDE) for a specified column in a DataFrame (df).
• plt.show() is used to display all the created plots.
• Each histogram provides a visual representation of the distribution of the specified
numerical columns, and the bars are colored based on whether an employee has churned or
not (as indicated by the 'Employee_Churn' column). This allows for a quick comparison of
the distributions for employees who have churned versus those who haven't in terms of age,
monthly income, and years at the company.
27.
28. VISUALISATION – MULTIVARIATE ANALYSIS – scatter plot
• Scatter plots are used to visualize
the relationship between two
continuous variables.
• Each data point is plotted on a
graph, with one variable on the x-
axis and the other on the y-axis.
• This helps you visualize patterns,
trends, and potential correlation
29. REPLACE
• df['Employee_Churn’]:
This selects the 'Employee_Churn' column in the DataFrame df.
• .replace({'No': 0, 'Yes': 1}):
This method replaces values in the specified column according to
the provided dictionary. In this case, it replaces 'No' with 0 and 'Yes'
with 1.
30. LABEL ENCODER
• This code defines a function
named labelencoder that uses
scikit-learn's LabelEncoder to
encode categorical columns in a
pandas DataFrame into numerical
values.
31. This code is a useful
way to visualize the
pairwise correlations
between features in
your dataset. It helps
identify relationships
between variables and
can be valuable for
feature selection and
understanding the
underlying structure of
your data.
FEATURE SELECTION
32. Checking For Imbalance In Dataset
The code is creating a pie chart
to visually represent imbalanced
data, where the two slices
represent the “Churn" and “Not
Churn" classes with different
explosion and colors to highlight
the imbalance.
The percentages of each class
are displayed on the chart, and a
legend is added for clarity.
33. SMOTE (Synthetic Minority
Over-sampling Technique),is
applied to the training data to
generate synthetic samples for
the minority class (where the
class with a minority of
examples is specified by the
sampling_strategy parameter).
This way, you can address class
imbalance in your dataset and
create a balanced training set for
your machine learning models.
We split our data before using
SMOTE
Balancing The Data using SMOTE
34. The bar plot
provides a visual
representation of
the balanced or
adjusted
distribution of
classes in the
target variable
after SMOTE.
35. Standardization, also known as feature scaling or normalization, is a preprocessing technique
commonly used in machine learning to bring all features or variables to a similar scale.
This process helps algorithms perform better by ensuring that no single feature dominates the
learning process due to its larger magnitude.
Standardization is particularly important for algorithms that rely on distances or gradients,
such as k-nearest neighbors
The goal of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1.
This transformation does not change the shape of the distribution of the data; it simply scales
and shifts the data to make it more suitable for modeling.
36. The purpose of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1. This is important, especially for algorithms that rely on distance
measures, as it ensures that all features contribute equally to the computations.
In this case, the features in x_sampled are standardized using the StandardScaler, and the result
is stored in the DataFrame standard_df. Each column in standard_df now represents a
standardized version of the corresponding feature in the original dataset.
FEATURE SCALING
37. PCA stands for Principal Component Analysis. It is a dimensionality reduction technique
commonly used in machine learning and statistics.
The main goal of PCA is to transform high-dimensional data into a new coordinate system,
capturing the most important information while minimizing information loss.
PCA achieves this by finding a set of orthogonal axes (principal components) along which
the data varies the most.
PCA – PRINCIPAL COMPONENT ANALYSIS
38. The purpose of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1. This is important, especially for algorithms that rely on distance
measures, as it ensures that all features contribute equally to the computations.
In this case, the features in x_sampled are standardized using the StandardScaler, and the result
is stored in the DataFrame standard_df. Each column in standard_df now represents a
standardized version of the corresponding feature in the original dataset.
FEATURE EXTRACTION USING PCA
39. KEY STEPS IN PCA
Standardization: Standardize the features (subtract the mean and divide by the standard
deviation) to ensure that all features have a similar scale.
Covariance Matrix: Compute the covariance matrix for the standardized data. The covariance
matrix represents the relationships between pairs of features.
Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix. This
yields a set of eigenvalues and corresponding eigenvectors.
Principal Components: The eigenvectors represent the principal components. These are the
directions in feature space along which the data varies the most. The corresponding eigenvalues
indicate the amount of variance captured by each principal component.
Projection: Project the original data onto the new coordinate system defined by the principal
components. This results in a reduced-dimensional representation of the data.
40.
41. TRAIN TEST SPLIT
By splitting your data into training and testing sets, you can use X_train and y_train to train
your machine learning model and then use X_test to evaluate its performance.
This is a common practice to assess how well your model generalizes to unseen data.
42. MODEL BUILDING, CLASSIFICATION
REPORT & EVALUATION
• Will now build the following models
• Logistic Regression
• K-Nearest Neighbors
• Decision Tree Classifier
• Random Forest
• Ada Boost
• Support Vector Classifier
43. Classification Report
• A classification report is a summary of the performance metrics for a classification model.
• Precision: Precision is a measure of how many of the predicted positive instances were actually true positives.
• Precision = (True Positives) / (True Positives + False Positives)
• High precision indicates that the model makes fewer false positive errors.
• Recall (also known as Sensitivity or True Positive Rate): Recall measures the proportion of actual positive instances that were
correctly predicted by the model.
• Recall = (True Positives) / (True Positives + False Negatives)
• High recall indicates that the model captures a large portion of the positive instances.
• F1-Score: The F1-Score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall
and is particularly useful when you want to consider both false positives and false negatives.
• F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
• The F1-Score ranges between 0 and 1, where a higher value indicates a better balance between precision and recall.
• Support: Support represents the number of instances in each class in the test dataset. It gives you an idea of the distribution of
data across different classes.
44. AUCROC_CURVE
• AUCROC_curve
This code will help you visualize the performance of model in terms of its ability to discriminate between the
positive and negative classes. The higher the AUC score, the better the model's performance.
Interpreting the AUC:
0.5 (Random Classifier): If the AUC is 0.5, it means that the model's performance is no better than random chance.
It's essentially saying that the model cannot distinguish between positive and negative cases effectively.
< 0.5 (Worse than Random): If the AUC is less than 0.5, it suggests that the model's performance is worse than
random chance. It is misclassifying cases in the opposite direction.
> 0.5 (Better than Random): If the AUC is greater than 0.5, it indicates that the model is performing better than
random chance. The higher the AUC, the better the model is at discriminating between the classes.
1.0 (Perfect Classifier): An AUC of 1.0 represents a perfect classifier. This means the model achieves perfect
discrimination, correctly classifying all positive cases while avoiding false positives.
45. Logistic Regression – Modelling & Classification Report
• Logistic regression is a statistical
and machine learning model
used for binary classification,
which means it's used when the
target variable (the variable you
want to predict) has two possible
outcomes or classes.
• Classification Report
Class 0 Class 1
Precision 0.79 0.83
Recall 0.86 0.74
F1 Score 0.83 0.78
47. K-Nearest Neighbour (KNN)– Modelling &
Classification Report
• KNN operates based on the principle
that similar data points tend to have
similar labels or values.
• It's a non-parametric algorithm, which
means it doesn't make assumptions
about the underlying data distribution.
• KNN considers all available training
data when making predictions, which
can be advantageous in some cases but
might be computationally expensive for
large datasets.
• Classification Report
Class 0 Class 1
Precision 0.94 0.83
Recall 0.83 0.94
F1 Score 0.88 0.88
49. Decision Tree – Modelling & Classification Report
• A Decision Tree is a popular
supervised ML algorithm used
for both classification and
regression tasks. It is a non-
parametric, non-linear model that
makes predictions by recursively
partitioning the dataset into
subsets based on the most
significant attribute(s) at each
node.
• Classification Report
Class 0 Class 1
Precision 0.78 0.73
Recall 0.74 0.76
F1 Score 0.76 0.74
51. Random Forest– Modelling & Classification Report
• Random Forest is an ensemble
machine learning algorithm that is
widely used for both classification
and regression tasks. It is a
powerful and versatile algorithm
known for its high accuracy and
robustness. Random Forest builds
multiple decision trees during
training and combines their
predictions to produce more reliable
and generalizable results.
• Classification Report
Class 0 Class 1
Precision 0.82 0.91
Recall 0.93 0.78
F1 Score 0.87 0.84
54. AdaBoost – Modelling & Classification Report
• AdaBoost, short for Adaptive
Boosting, is an ensemble learning
method used for classification and
regression tasks. It is particularly
effective in improving the
performance of weak learners
(models that perform slightly better
than random chance). The basic
idea behind AdaBoost is to combine
multiple weak learners to create a
strong classifier.
• Classification Report
Class 0 Class 1
Precision 0.79 0.80
Recall 0.83 0.76
F1 Score 0.81 0.78
56. Support Vector Classifier– Modelling & Classification
Report
• SVMs are adaptable and efficient
in a variety of applications
because they can manage high-
dimensional data and nonlinear
relationships.
• The SVM algorithm has the
characteristics to ignore the
outlier and finds the best
hyperplane that maximizes the
margin. SVM is robust to
outliers.
• Classification Report
Class 0 Class 1
Precision 0.85 0.90
Recall 0.92 0.82
F1 Score 0.89 0.85
58. COMPARING CLASSIFICATION REPORT & AUC SCORE OF VARIOUS MODELS
• Creating dictionary to compare Classification report and AUC Score of different models
60. Conclusion-
•In this Employee Churn prediction process, we started by examining a dataset with 1470
rows and 35 columns. It contained numerical & categorical variables, and we noticed an
imbalance in Employee churn column
•To address the data's characteristics, we performed data preprocessing.
• We bifurcated data into categorical and numerical to find any outliers using boxplot.
• Visualization was done using 3 types
Univariate Analysis – Count plot & Pie Chart
Bivariate Analysis – Count plots & Hist plots
Multivariate Analysis – Scatter diagram
•Later we balanced unbalanced data using SMOTE
•Standardization was used to scale certain features for better model.
• Principal Component Analysis a dimensionality reduction technique was used to to
transform high-dimensional data into a new coordinate system, capturing the most important
information while minimizing information loss.
61. • We divided the dataset into training and testing sets and explored Six different machine
learning models: Logistic Regression, K-Nearest Neighbour, Decision Tree Classifier, Random
forest, AdaBoost and Support Vector Classifier.
• However, our focus was on SVC , which showed the best performance in terms of accuracy,
AUC Score and Precision.
• The chosen Support Vector Classifier model achieved an accuracy of approximately 87.387%,
Precision score of 0.9047, & AUC score is 0.9524. This result ensures that the company can
minimize potential employee churn.
• The top two influential factors for PC1 and PC 3 after applying PCA for prediction Detection.
• In conclusion, by systematically preprocessing the data, selecting the right model, we
successfully built a model that enhances accuracy and lowers the risk of employee churn
prediction. This helps the company make more informed lending decisions and reduces the
chances of financial setbacks.
Conclusion-