SlideShare ist ein Scribd-Unternehmen logo
1 von 62
NAME: POOJA SHAH
Date of Assignment: 18/11/23
Date of Submission: 11/12/23
Project 2
Title: EMPLOYEE CHURN PREDICTION
Project Aim
 To determine whether an employee will churn or not , as well as
the loss incurred if it does churn.
 Create a system to prevent such churn for peaceful sustainability
of our company.
 This capstone project aims to uncover the factors that lead to employee attrition and
explore important questions by developing an employee churn prediction system
Overview of
Project
Predicting employee churn involves using machine
learning models to forecast whether an employee is
likely to leave a company in the near future. This is
a crucial task for organizations as it allows them to
take preventive measures such as improving work
conditions, offering incentives, or providing career
development opportunities to retain valuable
employees.
Project Contents-
• Problem Formulation
• Data collection
• Importing libraries, loading and
understanding the data
• Exploratory Data Analysis
• Data Preprocessing
• Data Visualization
• Graphs Analysis
• Checking imbalance in dataset
• Balancing the data using SMOTE
• Feature Scaling
• Feature Extraction using PCA
• Model building & Evaluation
• Logistic Regression
• KNN
• Decision Tree Classifier
• Random Forest
• ADA Boost
• Support Vector Classifier
• Comparing different models
• Conclusion
Importing libraries, loading and
understanding the data-
• We will be using the following libraries
1) Pandas
2) Numpy
3) Seaborn
4) Matplotlib.pyplot
Problem Formulation, Data Collection & Loading the Dataset
Exploratory Data Analysis
info () –
The info method
returns
the information non-
null count and dtype
of the data.
Exploratory Data Analysis
• Shape () -
With the help of shape attribute we can get to
know overall rows and columns in the data.
Exploratory Data Analysis
 df.isnull() - creates a DataFrame of the
same shape as df, where each entry is True
if the corresponding element in df is NaN
(null), and False otherwise.
 .sum() then calculates the sum of True
values along each column, resulting in a
Series that contains the total number of
missing values for each column.
 .to_frame() converts the Series into a
DataFrame.
 .rename(columns={0:"Total No. of
Missing Values"}) renames the column
containing the total number of missing
values to "Total No. of Missing Values."
missing_data["% of Missing Values"] =
df.isnull().mean()*100:
df.isnull().mean() calculates the proportion
of missing values for each column by taking
the mean (average) of the Boolean values in
the DataFrame. This gives the percentage of
missing values for each column.
*100 is then used to convert the proportions
into percentages.
The result is assigned to a new column in
the missing_data DataFrame called "% of
Missing Values."
Exploratory Data Analysis
Exploratory Data Analysis
• df.duplicated()
 this method finds duplicate rows in data
• df.duplicated().mean()*100
It converts duplicate values into percentage
Exploratory Data Analysis
• column_data_types = df.dtypes:
df.dtypes returns a Series containing the data
type of each column in the DataFrame.
 Counting numerical and categorical
columns:
 This loop iterates through each column in the
DataFrame and checks its data type.
• np.issubdtype(data_type, np.number)
 checks if the data type is a numerical type. If
true, it increments numerical_count; otherwise,
it increments categorical_count.
• describe().T –
• It generates descriptive statistics of the DataFrame's
numeric columns.
• .T  It is transpose operation. It switches the rows
and columns of the result obtained from describe()
• Getting the Count: The number of non-null values in each
column.
• Mean: The average value of each column.
• Standard Deviation (std): It indicates how much individual
data points deviate from the mean.
• Minimum (min): The smallest value in each column.
• 25th Percentile (25%): Also known as the first quartile, it's
the value below which 25% of the data falls.
• Median (50%): Also known as the second quartile or the
median, it's the middle value when the data is sorted. It
represents the central tendency.
• 75th Percentile (75%): Also known as the third quartile,
it's the value below which 75% of the data falls.
• Maximum (max): The largest value in each column
Pre-Processing
• df.rename(columns={"Attrition":
"Employee_Churn"}, inplace=True)
 The provided code is using the
rename method in pandas to rename a
column in a DataFrame.
• df.drop(columns=["Over18",
"EmployeeCount",
"EmployeeNumber",
"StandardHours"], inplace=True)
After executing this code, the
specified columns ("Over18",
"EmployeeCount",
"EmployeeNumber", and
"StandardHours") will be removed
from your DataFrame (df).
• df.columns
 returns names of all columns
Pre-Processing
 We will see the names of categorical columns and numerical columns in the DataFrame
printed to the console. This information can be helpful for further analysis, preprocessing, or
visualization tasks that may require handling different types of data separately.
Pre-Processing
This code is a common approach for identifying and handling outliers in a dataset using the
IQR method, and it also provides visualizations to assess the impact of the outlier handling
process. It ensures that extreme outliers do not unduly affect the analysis of the data.
 The result is a grid of boxplots, where each subplot corresponds to a numerical column in the
DataFrame. This visualization is useful for understanding the distribution and variability of
values in each numerical feature.
VISUALISATION – UNIVARIATE ANALYSIS – count plot & Pie Chart sub plot
• The result is a figure containing a count plot and a pie chart, both illustrating employee
churn in terms of counts and percentages, respectively. The count plot shows the
distribution of churn and non-churn instances, while the pie chart provides a visual
representation of the churn rate as a percentage.
VISUALISATION – BIVARIATE ANALYSIS – count plot
• Bivariate analysis is a
statistical analysis
technique that involves the
examination of the
relationship between two
variables. It is often used to
understand how one
variable affects or is related
to another variable.
• We then create count plots
for 2 categorical variables
VISUALISATION – BIVARIATE ANALYSIS – Hist Plot
• The provided code defines a function named hist_plot that creates a histogram with a kernel
density estimate (KDE) for a specified column in a DataFrame (df).
• plt.show() is used to display all the created plots.
• Each histogram provides a visual representation of the distribution of the specified
numerical columns, and the bars are colored based on whether an employee has churned or
not (as indicated by the 'Employee_Churn' column). This allows for a quick comparison of
the distributions for employees who have churned versus those who haven't in terms of age,
monthly income, and years at the company.
VISUALISATION – MULTIVARIATE ANALYSIS – scatter plot
• Scatter plots are used to visualize
the relationship between two
continuous variables.
• Each data point is plotted on a
graph, with one variable on the x-
axis and the other on the y-axis.
• This helps you visualize patterns,
trends, and potential correlation
REPLACE
• df['Employee_Churn’]:
 This selects the 'Employee_Churn' column in the DataFrame df.
• .replace({'No': 0, 'Yes': 1}):
 This method replaces values in the specified column according to
the provided dictionary. In this case, it replaces 'No' with 0 and 'Yes'
with 1.
LABEL ENCODER
• This code defines a function
named labelencoder that uses
scikit-learn's LabelEncoder to
encode categorical columns in a
pandas DataFrame into numerical
values.
This code is a useful
way to visualize the
pairwise correlations
between features in
your dataset. It helps
identify relationships
between variables and
can be valuable for
feature selection and
understanding the
underlying structure of
your data.
FEATURE SELECTION
Checking For Imbalance In Dataset
The code is creating a pie chart
to visually represent imbalanced
data, where the two slices
represent the “Churn" and “Not
Churn" classes with different
explosion and colors to highlight
the imbalance.
The percentages of each class
are displayed on the chart, and a
legend is added for clarity.
 SMOTE (Synthetic Minority
Over-sampling Technique),is
applied to the training data to
generate synthetic samples for
the minority class (where the
class with a minority of
examples is specified by the
sampling_strategy parameter).
 This way, you can address class
imbalance in your dataset and
create a balanced training set for
your machine learning models.
 We split our data before using
SMOTE
Balancing The Data using SMOTE
 The bar plot
provides a visual
representation of
the balanced or
adjusted
distribution of
classes in the
target variable
after SMOTE.
 Standardization, also known as feature scaling or normalization, is a preprocessing technique
commonly used in machine learning to bring all features or variables to a similar scale.
 This process helps algorithms perform better by ensuring that no single feature dominates the
learning process due to its larger magnitude.
 Standardization is particularly important for algorithms that rely on distances or gradients,
such as k-nearest neighbors
 The goal of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1.
 This transformation does not change the shape of the distribution of the data; it simply scales
and shifts the data to make it more suitable for modeling.
The purpose of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1. This is important, especially for algorithms that rely on distance
measures, as it ensures that all features contribute equally to the computations.
In this case, the features in x_sampled are standardized using the StandardScaler, and the result
is stored in the DataFrame standard_df. Each column in standard_df now represents a
standardized version of the corresponding feature in the original dataset.
FEATURE SCALING
PCA stands for Principal Component Analysis. It is a dimensionality reduction technique
commonly used in machine learning and statistics.
The main goal of PCA is to transform high-dimensional data into a new coordinate system,
capturing the most important information while minimizing information loss.
PCA achieves this by finding a set of orthogonal axes (principal components) along which
the data varies the most.
PCA – PRINCIPAL COMPONENT ANALYSIS
The purpose of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1. This is important, especially for algorithms that rely on distance
measures, as it ensures that all features contribute equally to the computations.
In this case, the features in x_sampled are standardized using the StandardScaler, and the result
is stored in the DataFrame standard_df. Each column in standard_df now represents a
standardized version of the corresponding feature in the original dataset.
FEATURE EXTRACTION USING PCA
KEY STEPS IN PCA
Standardization: Standardize the features (subtract the mean and divide by the standard
deviation) to ensure that all features have a similar scale.
Covariance Matrix: Compute the covariance matrix for the standardized data. The covariance
matrix represents the relationships between pairs of features.
Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix. This
yields a set of eigenvalues and corresponding eigenvectors.
Principal Components: The eigenvectors represent the principal components. These are the
directions in feature space along which the data varies the most. The corresponding eigenvalues
indicate the amount of variance captured by each principal component.
Projection: Project the original data onto the new coordinate system defined by the principal
components. This results in a reduced-dimensional representation of the data.
TRAIN TEST SPLIT
 By splitting your data into training and testing sets, you can use X_train and y_train to train
your machine learning model and then use X_test to evaluate its performance.
 This is a common practice to assess how well your model generalizes to unseen data.
MODEL BUILDING, CLASSIFICATION
REPORT & EVALUATION
• Will now build the following models
• Logistic Regression
• K-Nearest Neighbors
• Decision Tree Classifier
• Random Forest
• Ada Boost
• Support Vector Classifier
Classification Report
• A classification report is a summary of the performance metrics for a classification model.
• Precision: Precision is a measure of how many of the predicted positive instances were actually true positives.
• Precision = (True Positives) / (True Positives + False Positives)
• High precision indicates that the model makes fewer false positive errors.
• Recall (also known as Sensitivity or True Positive Rate): Recall measures the proportion of actual positive instances that were
correctly predicted by the model.
• Recall = (True Positives) / (True Positives + False Negatives)
• High recall indicates that the model captures a large portion of the positive instances.
• F1-Score: The F1-Score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall
and is particularly useful when you want to consider both false positives and false negatives.
• F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
• The F1-Score ranges between 0 and 1, where a higher value indicates a better balance between precision and recall.
• Support: Support represents the number of instances in each class in the test dataset. It gives you an idea of the distribution of
data across different classes.
AUCROC_CURVE
• AUCROC_curve
This code will help you visualize the performance of model in terms of its ability to discriminate between the
positive and negative classes. The higher the AUC score, the better the model's performance.
Interpreting the AUC:
0.5 (Random Classifier): If the AUC is 0.5, it means that the model's performance is no better than random chance.
It's essentially saying that the model cannot distinguish between positive and negative cases effectively.
< 0.5 (Worse than Random): If the AUC is less than 0.5, it suggests that the model's performance is worse than
random chance. It is misclassifying cases in the opposite direction.
> 0.5 (Better than Random): If the AUC is greater than 0.5, it indicates that the model is performing better than
random chance. The higher the AUC, the better the model is at discriminating between the classes.
1.0 (Perfect Classifier): An AUC of 1.0 represents a perfect classifier. This means the model achieves perfect
discrimination, correctly classifying all positive cases while avoiding false positives.
Logistic Regression – Modelling & Classification Report
• Logistic regression is a statistical
and machine learning model
used for binary classification,
which means it's used when the
target variable (the variable you
want to predict) has two possible
outcomes or classes.
• Classification Report
Class 0 Class 1
Precision 0.79 0.83
 Recall 0.86 0.74
 F1 Score 0.83 0.78
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.8964
K-Nearest Neighbour (KNN)– Modelling &
Classification Report
• KNN operates based on the principle
that similar data points tend to have
similar labels or values.
• It's a non-parametric algorithm, which
means it doesn't make assumptions
about the underlying data distribution.
• KNN considers all available training
data when making predictions, which
can be advantageous in some cases but
might be computationally expensive for
large datasets.
• Classification Report
Class 0 Class 1
Precision 0.94 0.83
 Recall 0.83 0.94
 F1 Score 0.88 0.88
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.9325
Decision Tree – Modelling & Classification Report
• A Decision Tree is a popular
supervised ML algorithm used
for both classification and
regression tasks. It is a non-
parametric, non-linear model that
makes predictions by recursively
partitioning the dataset into
subsets based on the most
significant attribute(s) at each
node.
• Classification Report
Class 0 Class 1
Precision 0.78 0.73
 Recall 0.74 0.76
 F1 Score 0.76 0.74
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.7527
Random Forest– Modelling & Classification Report
• Random Forest is an ensemble
machine learning algorithm that is
widely used for both classification
and regression tasks. It is a
powerful and versatile algorithm
known for its high accuracy and
robustness. Random Forest builds
multiple decision trees during
training and combines their
predictions to produce more reliable
and generalizable results.
• Classification Report
Class 0 Class 1
Precision 0.82 0.91
 Recall 0.93 0.78
 F1 Score 0.87 0.84
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.9374
Feature importance
AdaBoost – Modelling & Classification Report
• AdaBoost, short for Adaptive
Boosting, is an ensemble learning
method used for classification and
regression tasks. It is particularly
effective in improving the
performance of weak learners
(models that perform slightly better
than random chance). The basic
idea behind AdaBoost is to combine
multiple weak learners to create a
strong classifier.
• Classification Report
Class 0 Class 1
Precision 0.79 0.80
 Recall 0.83 0.76
 F1 Score 0.81 0.78
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.8904
Support Vector Classifier– Modelling & Classification
Report
• SVMs are adaptable and efficient
in a variety of applications
because they can manage high-
dimensional data and nonlinear
relationships.
• The SVM algorithm has the
characteristics to ignore the
outlier and finds the best
hyperplane that maximizes the
margin. SVM is robust to
outliers.
• Classification Report
Class 0 Class 1
Precision 0.85 0.90
 Recall 0.92 0.82
 F1 Score 0.89 0.85
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.9524
COMPARING CLASSIFICATION REPORT & AUC SCORE OF VARIOUS MODELS
• Creating dictionary to compare Classification report and AUC Score of different models
COMPARING CLASSIFICATION REPORT & AUC SCORE OF VARIOUS MODELS
Conclusion-
•In this Employee Churn prediction process, we started by examining a dataset with 1470
rows and 35 columns. It contained numerical & categorical variables, and we noticed an
imbalance in Employee churn column
•To address the data's characteristics, we performed data preprocessing.
• We bifurcated data into categorical and numerical to find any outliers using boxplot.
• Visualization was done using 3 types
 Univariate Analysis – Count plot & Pie Chart
 Bivariate Analysis – Count plots & Hist plots
 Multivariate Analysis – Scatter diagram
•Later we balanced unbalanced data using SMOTE
•Standardization was used to scale certain features for better model.
• Principal Component Analysis a dimensionality reduction technique was used to to
transform high-dimensional data into a new coordinate system, capturing the most important
information while minimizing information loss.
• We divided the dataset into training and testing sets and explored Six different machine
learning models: Logistic Regression, K-Nearest Neighbour, Decision Tree Classifier, Random
forest, AdaBoost and Support Vector Classifier.
• However, our focus was on SVC , which showed the best performance in terms of accuracy,
AUC Score and Precision.
• The chosen Support Vector Classifier model achieved an accuracy of approximately 87.387%,
Precision score of 0.9047, & AUC score is 0.9524. This result ensures that the company can
minimize potential employee churn.
• The top two influential factors for PC1 and PC 3 after applying PCA for prediction Detection.
• In conclusion, by systematically preprocessing the data, selecting the right model, we
successfully built a model that enhances accuracy and lowers the risk of employee churn
prediction. This helps the company make more informed lending decisions and reduces the
chances of financial setbacks.
Conclusion-
Thank you!!!

Weitere ähnliche Inhalte

Ähnlich wie Predicting Employee Churn: A Data-Driven Approach Project Presentation

Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptxPriyadharshiniG41
 
Excel Datamining Addin Beginner
Excel Datamining Addin BeginnerExcel Datamining Addin Beginner
Excel Datamining Addin Beginnerexcel content
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
Software quality management question bank
Software quality management question bankSoftware quality management question bank
Software quality management question bankselinasimpson3001
 
Lead Scoring Group Case Study Presentation.pdf
Lead Scoring Group Case Study Presentation.pdfLead Scoring Group Case Study Presentation.pdf
Lead Scoring Group Case Study Presentation.pdfKrishP2
 
B409 W11 Sas Collaborative Stats Guide V4.2
B409 W11 Sas Collaborative Stats Guide V4.2B409 W11 Sas Collaborative Stats Guide V4.2
B409 W11 Sas Collaborative Stats Guide V4.2marshalkalra
 
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningA Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningVenkata Karthik Gullapalli
 
Quality management presentation
Quality management presentationQuality management presentation
Quality management presentationselinasimpson1501
 
BSA_AML Rule Tuning
BSA_AML Rule TuningBSA_AML Rule Tuning
BSA_AML Rule TuningMayank Johri
 
Approach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule ThresholdsApproach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule ThresholdsMayank Johri
 

Ähnlich wie Predicting Employee Churn: A Data-Driven Approach Project Presentation (20)

Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
1234
12341234
1234
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 
Excel Datamining Addin Beginner
Excel Datamining Addin BeginnerExcel Datamining Addin Beginner
Excel Datamining Addin Beginner
 
Excel Datamining Addin Beginner
Excel Datamining Addin BeginnerExcel Datamining Addin Beginner
Excel Datamining Addin Beginner
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
Software quality management question bank
Software quality management question bankSoftware quality management question bank
Software quality management question bank
 
Lead Scoring Group Case Study Presentation.pdf
Lead Scoring Group Case Study Presentation.pdfLead Scoring Group Case Study Presentation.pdf
Lead Scoring Group Case Study Presentation.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
B409 W11 Sas Collaborative Stats Guide V4.2
B409 W11 Sas Collaborative Stats Guide V4.2B409 W11 Sas Collaborative Stats Guide V4.2
B409 W11 Sas Collaborative Stats Guide V4.2
 
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningA Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
 
Quality management presentation
Quality management presentationQuality management presentation
Quality management presentation
 
working with python
working with pythonworking with python
working with python
 
BSA_AML Rule Tuning
BSA_AML Rule TuningBSA_AML Rule Tuning
BSA_AML Rule Tuning
 
Approach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule ThresholdsApproach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule Thresholds
 

Mehr von Boston Institute of Analytics

Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgEnhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgBoston Institute of Analytics
 
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFExploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFBoston Institute of Analytics
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Boston Institute of Analytics
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
NLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile PricesNLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile PricesBoston Institute of Analytics
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud DetectionCombating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud DetectionBoston Institute of Analytics
 
Predicting Liver Disease in India: A Machine Learning Approach
Predicting Liver Disease in India: A Machine Learning ApproachPredicting Liver Disease in India: A Machine Learning Approach
Predicting Liver Disease in India: A Machine Learning ApproachBoston Institute of Analytics
 
Employee Churn Prediction: Artificial Intelligence Project Presentation
Employee Churn Prediction: Artificial Intelligence Project PresentationEmployee Churn Prediction: Artificial Intelligence Project Presentation
Employee Churn Prediction: Artificial Intelligence Project PresentationBoston Institute of Analytics
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Mehr von Boston Institute of Analytics (20)

Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgEnhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
 
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFExploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Detecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven ApproachDetecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven Approach
 
Predicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning ApproachPredicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning Approach
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
NLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile PricesNLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile Prices
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Analyzing Movie Reviews : Machine learning project
Analyzing Movie Reviews : Machine learning projectAnalyzing Movie Reviews : Machine learning project
Analyzing Movie Reviews : Machine learning project
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud DetectionCombating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection
 
Predicting Liver Disease in India: A Machine Learning Approach
Predicting Liver Disease in India: A Machine Learning ApproachPredicting Liver Disease in India: A Machine Learning Approach
Predicting Liver Disease in India: A Machine Learning Approach
 
Employee Churn Prediction: Artificial Intelligence Project Presentation
Employee Churn Prediction: Artificial Intelligence Project PresentationEmployee Churn Prediction: Artificial Intelligence Project Presentation
Employee Churn Prediction: Artificial Intelligence Project Presentation
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

Kürzlich hochgeladen

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 

Kürzlich hochgeladen (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 

Predicting Employee Churn: A Data-Driven Approach Project Presentation

  • 1. NAME: POOJA SHAH Date of Assignment: 18/11/23 Date of Submission: 11/12/23 Project 2 Title: EMPLOYEE CHURN PREDICTION
  • 2. Project Aim  To determine whether an employee will churn or not , as well as the loss incurred if it does churn.  Create a system to prevent such churn for peaceful sustainability of our company.  This capstone project aims to uncover the factors that lead to employee attrition and explore important questions by developing an employee churn prediction system
  • 3. Overview of Project Predicting employee churn involves using machine learning models to forecast whether an employee is likely to leave a company in the near future. This is a crucial task for organizations as it allows them to take preventive measures such as improving work conditions, offering incentives, or providing career development opportunities to retain valuable employees.
  • 4. Project Contents- • Problem Formulation • Data collection • Importing libraries, loading and understanding the data • Exploratory Data Analysis • Data Preprocessing • Data Visualization • Graphs Analysis • Checking imbalance in dataset • Balancing the data using SMOTE • Feature Scaling • Feature Extraction using PCA • Model building & Evaluation • Logistic Regression • KNN • Decision Tree Classifier • Random Forest • ADA Boost • Support Vector Classifier • Comparing different models • Conclusion
  • 5. Importing libraries, loading and understanding the data- • We will be using the following libraries 1) Pandas 2) Numpy 3) Seaborn 4) Matplotlib.pyplot
  • 6. Problem Formulation, Data Collection & Loading the Dataset
  • 7. Exploratory Data Analysis info () – The info method returns the information non- null count and dtype of the data.
  • 8. Exploratory Data Analysis • Shape () - With the help of shape attribute we can get to know overall rows and columns in the data.
  • 9. Exploratory Data Analysis  df.isnull() - creates a DataFrame of the same shape as df, where each entry is True if the corresponding element in df is NaN (null), and False otherwise.  .sum() then calculates the sum of True values along each column, resulting in a Series that contains the total number of missing values for each column.  .to_frame() converts the Series into a DataFrame.  .rename(columns={0:"Total No. of Missing Values"}) renames the column containing the total number of missing values to "Total No. of Missing Values." missing_data["% of Missing Values"] = df.isnull().mean()*100: df.isnull().mean() calculates the proportion of missing values for each column by taking the mean (average) of the Boolean values in the DataFrame. This gives the percentage of missing values for each column. *100 is then used to convert the proportions into percentages. The result is assigned to a new column in the missing_data DataFrame called "% of Missing Values."
  • 11. Exploratory Data Analysis • df.duplicated()  this method finds duplicate rows in data • df.duplicated().mean()*100 It converts duplicate values into percentage
  • 12. Exploratory Data Analysis • column_data_types = df.dtypes: df.dtypes returns a Series containing the data type of each column in the DataFrame.  Counting numerical and categorical columns:  This loop iterates through each column in the DataFrame and checks its data type. • np.issubdtype(data_type, np.number)  checks if the data type is a numerical type. If true, it increments numerical_count; otherwise, it increments categorical_count.
  • 13. • describe().T – • It generates descriptive statistics of the DataFrame's numeric columns. • .T  It is transpose operation. It switches the rows and columns of the result obtained from describe() • Getting the Count: The number of non-null values in each column. • Mean: The average value of each column. • Standard Deviation (std): It indicates how much individual data points deviate from the mean. • Minimum (min): The smallest value in each column. • 25th Percentile (25%): Also known as the first quartile, it's the value below which 25% of the data falls. • Median (50%): Also known as the second quartile or the median, it's the middle value when the data is sorted. It represents the central tendency. • 75th Percentile (75%): Also known as the third quartile, it's the value below which 75% of the data falls. • Maximum (max): The largest value in each column
  • 14. Pre-Processing • df.rename(columns={"Attrition": "Employee_Churn"}, inplace=True)  The provided code is using the rename method in pandas to rename a column in a DataFrame. • df.drop(columns=["Over18", "EmployeeCount", "EmployeeNumber", "StandardHours"], inplace=True) After executing this code, the specified columns ("Over18", "EmployeeCount", "EmployeeNumber", and "StandardHours") will be removed from your DataFrame (df). • df.columns  returns names of all columns
  • 15. Pre-Processing  We will see the names of categorical columns and numerical columns in the DataFrame printed to the console. This information can be helpful for further analysis, preprocessing, or visualization tasks that may require handling different types of data separately.
  • 16. Pre-Processing This code is a common approach for identifying and handling outliers in a dataset using the IQR method, and it also provides visualizations to assess the impact of the outlier handling process. It ensures that extreme outliers do not unduly affect the analysis of the data.  The result is a grid of boxplots, where each subplot corresponds to a numerical column in the DataFrame. This visualization is useful for understanding the distribution and variability of values in each numerical feature.
  • 17.
  • 18.
  • 19.
  • 20. VISUALISATION – UNIVARIATE ANALYSIS – count plot & Pie Chart sub plot • The result is a figure containing a count plot and a pie chart, both illustrating employee churn in terms of counts and percentages, respectively. The count plot shows the distribution of churn and non-churn instances, while the pie chart provides a visual representation of the churn rate as a percentage.
  • 21.
  • 22. VISUALISATION – BIVARIATE ANALYSIS – count plot • Bivariate analysis is a statistical analysis technique that involves the examination of the relationship between two variables. It is often used to understand how one variable affects or is related to another variable. • We then create count plots for 2 categorical variables
  • 23.
  • 24.
  • 25.
  • 26. VISUALISATION – BIVARIATE ANALYSIS – Hist Plot • The provided code defines a function named hist_plot that creates a histogram with a kernel density estimate (KDE) for a specified column in a DataFrame (df). • plt.show() is used to display all the created plots. • Each histogram provides a visual representation of the distribution of the specified numerical columns, and the bars are colored based on whether an employee has churned or not (as indicated by the 'Employee_Churn' column). This allows for a quick comparison of the distributions for employees who have churned versus those who haven't in terms of age, monthly income, and years at the company.
  • 27.
  • 28. VISUALISATION – MULTIVARIATE ANALYSIS – scatter plot • Scatter plots are used to visualize the relationship between two continuous variables. • Each data point is plotted on a graph, with one variable on the x- axis and the other on the y-axis. • This helps you visualize patterns, trends, and potential correlation
  • 29. REPLACE • df['Employee_Churn’]:  This selects the 'Employee_Churn' column in the DataFrame df. • .replace({'No': 0, 'Yes': 1}):  This method replaces values in the specified column according to the provided dictionary. In this case, it replaces 'No' with 0 and 'Yes' with 1.
  • 30. LABEL ENCODER • This code defines a function named labelencoder that uses scikit-learn's LabelEncoder to encode categorical columns in a pandas DataFrame into numerical values.
  • 31. This code is a useful way to visualize the pairwise correlations between features in your dataset. It helps identify relationships between variables and can be valuable for feature selection and understanding the underlying structure of your data. FEATURE SELECTION
  • 32. Checking For Imbalance In Dataset The code is creating a pie chart to visually represent imbalanced data, where the two slices represent the “Churn" and “Not Churn" classes with different explosion and colors to highlight the imbalance. The percentages of each class are displayed on the chart, and a legend is added for clarity.
  • 33.  SMOTE (Synthetic Minority Over-sampling Technique),is applied to the training data to generate synthetic samples for the minority class (where the class with a minority of examples is specified by the sampling_strategy parameter).  This way, you can address class imbalance in your dataset and create a balanced training set for your machine learning models.  We split our data before using SMOTE Balancing The Data using SMOTE
  • 34.  The bar plot provides a visual representation of the balanced or adjusted distribution of classes in the target variable after SMOTE.
  • 35.  Standardization, also known as feature scaling or normalization, is a preprocessing technique commonly used in machine learning to bring all features or variables to a similar scale.  This process helps algorithms perform better by ensuring that no single feature dominates the learning process due to its larger magnitude.  Standardization is particularly important for algorithms that rely on distances or gradients, such as k-nearest neighbors  The goal of standardization is to transform the features so that they have a mean of 0 and a standard deviation of 1.  This transformation does not change the shape of the distribution of the data; it simply scales and shifts the data to make it more suitable for modeling.
  • 36. The purpose of standardization is to transform the features so that they have a mean of 0 and a standard deviation of 1. This is important, especially for algorithms that rely on distance measures, as it ensures that all features contribute equally to the computations. In this case, the features in x_sampled are standardized using the StandardScaler, and the result is stored in the DataFrame standard_df. Each column in standard_df now represents a standardized version of the corresponding feature in the original dataset. FEATURE SCALING
  • 37. PCA stands for Principal Component Analysis. It is a dimensionality reduction technique commonly used in machine learning and statistics. The main goal of PCA is to transform high-dimensional data into a new coordinate system, capturing the most important information while minimizing information loss. PCA achieves this by finding a set of orthogonal axes (principal components) along which the data varies the most. PCA – PRINCIPAL COMPONENT ANALYSIS
  • 38. The purpose of standardization is to transform the features so that they have a mean of 0 and a standard deviation of 1. This is important, especially for algorithms that rely on distance measures, as it ensures that all features contribute equally to the computations. In this case, the features in x_sampled are standardized using the StandardScaler, and the result is stored in the DataFrame standard_df. Each column in standard_df now represents a standardized version of the corresponding feature in the original dataset. FEATURE EXTRACTION USING PCA
  • 39. KEY STEPS IN PCA Standardization: Standardize the features (subtract the mean and divide by the standard deviation) to ensure that all features have a similar scale. Covariance Matrix: Compute the covariance matrix for the standardized data. The covariance matrix represents the relationships between pairs of features. Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix. This yields a set of eigenvalues and corresponding eigenvectors. Principal Components: The eigenvectors represent the principal components. These are the directions in feature space along which the data varies the most. The corresponding eigenvalues indicate the amount of variance captured by each principal component. Projection: Project the original data onto the new coordinate system defined by the principal components. This results in a reduced-dimensional representation of the data.
  • 40.
  • 41. TRAIN TEST SPLIT  By splitting your data into training and testing sets, you can use X_train and y_train to train your machine learning model and then use X_test to evaluate its performance.  This is a common practice to assess how well your model generalizes to unseen data.
  • 42. MODEL BUILDING, CLASSIFICATION REPORT & EVALUATION • Will now build the following models • Logistic Regression • K-Nearest Neighbors • Decision Tree Classifier • Random Forest • Ada Boost • Support Vector Classifier
  • 43. Classification Report • A classification report is a summary of the performance metrics for a classification model. • Precision: Precision is a measure of how many of the predicted positive instances were actually true positives. • Precision = (True Positives) / (True Positives + False Positives) • High precision indicates that the model makes fewer false positive errors. • Recall (also known as Sensitivity or True Positive Rate): Recall measures the proportion of actual positive instances that were correctly predicted by the model. • Recall = (True Positives) / (True Positives + False Negatives) • High recall indicates that the model captures a large portion of the positive instances. • F1-Score: The F1-Score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall and is particularly useful when you want to consider both false positives and false negatives. • F1-Score = 2 * (Precision * Recall) / (Precision + Recall) • The F1-Score ranges between 0 and 1, where a higher value indicates a better balance between precision and recall. • Support: Support represents the number of instances in each class in the test dataset. It gives you an idea of the distribution of data across different classes.
  • 44. AUCROC_CURVE • AUCROC_curve This code will help you visualize the performance of model in terms of its ability to discriminate between the positive and negative classes. The higher the AUC score, the better the model's performance. Interpreting the AUC: 0.5 (Random Classifier): If the AUC is 0.5, it means that the model's performance is no better than random chance. It's essentially saying that the model cannot distinguish between positive and negative cases effectively. < 0.5 (Worse than Random): If the AUC is less than 0.5, it suggests that the model's performance is worse than random chance. It is misclassifying cases in the opposite direction. > 0.5 (Better than Random): If the AUC is greater than 0.5, it indicates that the model is performing better than random chance. The higher the AUC, the better the model is at discriminating between the classes. 1.0 (Perfect Classifier): An AUC of 1.0 represents a perfect classifier. This means the model achieves perfect discrimination, correctly classifying all positive cases while avoiding false positives.
  • 45. Logistic Regression – Modelling & Classification Report • Logistic regression is a statistical and machine learning model used for binary classification, which means it's used when the target variable (the variable you want to predict) has two possible outcomes or classes. • Classification Report Class 0 Class 1 Precision 0.79 0.83  Recall 0.86 0.74  F1 Score 0.83 0.78
  • 46. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.8964
  • 47. K-Nearest Neighbour (KNN)– Modelling & Classification Report • KNN operates based on the principle that similar data points tend to have similar labels or values. • It's a non-parametric algorithm, which means it doesn't make assumptions about the underlying data distribution. • KNN considers all available training data when making predictions, which can be advantageous in some cases but might be computationally expensive for large datasets. • Classification Report Class 0 Class 1 Precision 0.94 0.83  Recall 0.83 0.94  F1 Score 0.88 0.88
  • 48. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.9325
  • 49. Decision Tree – Modelling & Classification Report • A Decision Tree is a popular supervised ML algorithm used for both classification and regression tasks. It is a non- parametric, non-linear model that makes predictions by recursively partitioning the dataset into subsets based on the most significant attribute(s) at each node. • Classification Report Class 0 Class 1 Precision 0.78 0.73  Recall 0.74 0.76  F1 Score 0.76 0.74
  • 50. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.7527
  • 51. Random Forest– Modelling & Classification Report • Random Forest is an ensemble machine learning algorithm that is widely used for both classification and regression tasks. It is a powerful and versatile algorithm known for its high accuracy and robustness. Random Forest builds multiple decision trees during training and combines their predictions to produce more reliable and generalizable results. • Classification Report Class 0 Class 1 Precision 0.82 0.91  Recall 0.93 0.78  F1 Score 0.87 0.84
  • 52. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.9374
  • 54. AdaBoost – Modelling & Classification Report • AdaBoost, short for Adaptive Boosting, is an ensemble learning method used for classification and regression tasks. It is particularly effective in improving the performance of weak learners (models that perform slightly better than random chance). The basic idea behind AdaBoost is to combine multiple weak learners to create a strong classifier. • Classification Report Class 0 Class 1 Precision 0.79 0.80  Recall 0.83 0.76  F1 Score 0.81 0.78
  • 55. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.8904
  • 56. Support Vector Classifier– Modelling & Classification Report • SVMs are adaptable and efficient in a variety of applications because they can manage high- dimensional data and nonlinear relationships. • The SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM is robust to outliers. • Classification Report Class 0 Class 1 Precision 0.85 0.90  Recall 0.92 0.82  F1 Score 0.89 0.85
  • 57. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.9524
  • 58. COMPARING CLASSIFICATION REPORT & AUC SCORE OF VARIOUS MODELS • Creating dictionary to compare Classification report and AUC Score of different models
  • 59. COMPARING CLASSIFICATION REPORT & AUC SCORE OF VARIOUS MODELS
  • 60. Conclusion- •In this Employee Churn prediction process, we started by examining a dataset with 1470 rows and 35 columns. It contained numerical & categorical variables, and we noticed an imbalance in Employee churn column •To address the data's characteristics, we performed data preprocessing. • We bifurcated data into categorical and numerical to find any outliers using boxplot. • Visualization was done using 3 types  Univariate Analysis – Count plot & Pie Chart  Bivariate Analysis – Count plots & Hist plots  Multivariate Analysis – Scatter diagram •Later we balanced unbalanced data using SMOTE •Standardization was used to scale certain features for better model. • Principal Component Analysis a dimensionality reduction technique was used to to transform high-dimensional data into a new coordinate system, capturing the most important information while minimizing information loss.
  • 61. • We divided the dataset into training and testing sets and explored Six different machine learning models: Logistic Regression, K-Nearest Neighbour, Decision Tree Classifier, Random forest, AdaBoost and Support Vector Classifier. • However, our focus was on SVC , which showed the best performance in terms of accuracy, AUC Score and Precision. • The chosen Support Vector Classifier model achieved an accuracy of approximately 87.387%, Precision score of 0.9047, & AUC score is 0.9524. This result ensures that the company can minimize potential employee churn. • The top two influential factors for PC1 and PC 3 after applying PCA for prediction Detection. • In conclusion, by systematically preprocessing the data, selecting the right model, we successfully built a model that enhances accuracy and lowers the risk of employee churn prediction. This helps the company make more informed lending decisions and reduces the chances of financial setbacks. Conclusion-