Project Summary: Deployment of an End-to-End ML Project on Imbalanced Data

This project presents a complete machine learning pipeline for handling imbalanced datasets, specifically focusing on customer churn prediction. It utilizes DVC (Data Version Control) for data and model versioning and MLflow for experiment tracking. To address class imbalance, the pipeline integrates ADASYN (Adaptive Synthetic Sampling), improving model performance by generating synthetic samples for the minority class.

The workflow follows a structured approach, including data preprocessing, model training, evaluation, and experiment tracking, ensuring reproducibility and scalability.

Project Stages

Data Exploration & EDA:
- The EDA.ipynb notebook provides a comprehensive exploratory data analysis (EDA) of the dataset.
- It examines feature distributions, trends, and their influence on churn prediction.
- At the end of this stage, decisions regarding feature selection, removal, and encoding are documented.
Data Preprocessing:
- The preprocess.py script cleans the dataset, converts categorical features, and handles missing values.
- The dataset is split into train and test sets and saved into structured directories for subsequent modeling steps.
Model Training & Evaluation:
- The train.py script trains two models: Logistic Regression and XGBoost with hyperparameter tuning.
- Feature transformation is applied using OneHotEncoder for categorical variables and StandardScaler for numerical features.
- To address class imbalance, ADASYN (Adaptive Synthetic Sampling) is applied to the training dataset.
- Hyperparameter tuning for XGBoost is performed using GridSearchCV.
- Experiment tracking is managed using MLflow, logging model parameters, performance metrics, and visualizations such as confusion matrices and ROC curves.
- The trained models and transformers are saved for deployment.
Final Model Evaluation on Test Data:
- The evaluate.py script loads the trained models and transformer to evaluate final performance on the separate test dataset.
- Key performance metrics include accuracy, F1-score, and AUC-ROC score, which are logged into MLflow for experiment tracking.
- This stage provides the final validation of model performance before deployment.

Technologies Used

Python, pandas, scikit-learn, XGBoost, MLflow
DVC for dataset version control
ADASYN for handling imbalanced data
Matplotlib for visualization (confusion matrix, ROC curves)

How to Run the Pipeline

Follow these steps to set up and execute the machine learning pipeline:

Install Dependencies:
```
pip install -r requirements.txt
```
Set Up MLflow Credentials:
- Create a .env file in the project root directory.
- Refer to Sample.env for the required credentials format.
Initialize DVC:
```
dvc init
```
Run the Pipeline:
```
dvc repro
```
This command will automatically execute all stages defined in dvc.yaml in the correct order.
Track Changes (Optional): If you see a message prompting you to track changes, run:
```
git add dvc.lock models/.gitignore
```
Push Data & Models to Remote Storage (Optional): If using a remote DVC storage, push your data and models:
```
dvc push
```

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.dvc		.dvc
data		data
models		models
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
EDA.ipynb		EDA.ipynb
README.md		README.md
Sample.env		Sample.env
confusion_matrix.png		confusion_matrix.png
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
requirements.txt		requirements.txt
roc_curve.png		roc_curve.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Summary: Deployment of an End-to-End ML Project on Imbalanced Data

Project Stages

Technologies Used

How to Run the Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Shah-xai/Customer_churn_MLOPS

Folders and files

Latest commit

History

Repository files navigation

Project Summary: Deployment of an End-to-End ML Project on Imbalanced Data

Project Stages

Technologies Used

How to Run the Pipeline

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages