This project presents a complete machine learning pipeline for handling imbalanced datasets, specifically focusing on customer churn prediction. It utilizes DVC (Data Version Control) for data and model versioning and MLflow for experiment tracking. To address class imbalance, the pipeline integrates ADASYN (Adaptive Synthetic Sampling), improving model performance by generating synthetic samples for the minority class.
The workflow follows a structured approach, including data preprocessing, model training, evaluation, and experiment tracking, ensuring reproducibility and scalability.
-
Data Exploration & EDA:
- The
EDA.ipynbnotebook provides a comprehensive exploratory data analysis (EDA) of the dataset. - It examines feature distributions, trends, and their influence on churn prediction.
- At the end of this stage, decisions regarding feature selection, removal, and encoding are documented.
- The
-
Data Preprocessing:
- The
preprocess.pyscript cleans the dataset, converts categorical features, and handles missing values. - The dataset is split into train and test sets and saved into structured directories for subsequent modeling steps.
- The
-
Model Training & Evaluation:
- The
train.pyscript trains two models: Logistic Regression and XGBoost with hyperparameter tuning. - Feature transformation is applied using
OneHotEncoderfor categorical variables andStandardScalerfor numerical features. - To address class imbalance, ADASYN (Adaptive Synthetic Sampling) is applied to the training dataset.
- Hyperparameter tuning for XGBoost is performed using
GridSearchCV. - Experiment tracking is managed using MLflow, logging model parameters, performance metrics, and visualizations such as confusion matrices and ROC curves.
- The trained models and transformers are saved for deployment.
- The
-
Final Model Evaluation on Test Data:
- The
evaluate.pyscript loads the trained models and transformer to evaluate final performance on the separate test dataset. - Key performance metrics include accuracy, F1-score, and AUC-ROC score, which are logged into MLflow for experiment tracking.
- This stage provides the final validation of model performance before deployment.
- The
- Python, pandas, scikit-learn, XGBoost, MLflow
- DVC for dataset version control
- ADASYN for handling imbalanced data
- Matplotlib for visualization (confusion matrix, ROC curves)
Follow these steps to set up and execute the machine learning pipeline:
-
Install Dependencies:
pip install -r requirements.txt
-
Set Up MLflow Credentials:
- Create a
.envfile in the project root directory. - Refer to
Sample.envfor the required credentials format.
- Create a
-
Initialize DVC:
dvc init
-
Run the Pipeline:
dvc repro
This command will automatically execute all stages defined in
dvc.yamlin the correct order. -
Track Changes (Optional): If you see a message prompting you to track changes, run:
git add dvc.lock models/.gitignore
-
Push Data & Models to Remote Storage (Optional): If using a remote DVC storage, push your data and models:
dvc push