Skip to content

End-to-end machine learning pipeline for customer churn prediction on imbalanced data. Includes ADASYN oversampling, DVC for version control, and MLflow for experiment tracking and model management.

Notifications You must be signed in to change notification settings

Shah-xai/Customer_churn_MLOPS

Repository files navigation

Project Summary: Deployment of an End-to-End ML Project on Imbalanced Data

This project presents a complete machine learning pipeline for handling imbalanced datasets, specifically focusing on customer churn prediction. It utilizes DVC (Data Version Control) for data and model versioning and MLflow for experiment tracking. To address class imbalance, the pipeline integrates ADASYN (Adaptive Synthetic Sampling), improving model performance by generating synthetic samples for the minority class.

The workflow follows a structured approach, including data preprocessing, model training, evaluation, and experiment tracking, ensuring reproducibility and scalability.

Project Stages

  1. Data Exploration & EDA:

    • The EDA.ipynb notebook provides a comprehensive exploratory data analysis (EDA) of the dataset.
    • It examines feature distributions, trends, and their influence on churn prediction.
    • At the end of this stage, decisions regarding feature selection, removal, and encoding are documented.
  2. Data Preprocessing:

    • The preprocess.py script cleans the dataset, converts categorical features, and handles missing values.
    • The dataset is split into train and test sets and saved into structured directories for subsequent modeling steps.
  3. Model Training & Evaluation:

    • The train.py script trains two models: Logistic Regression and XGBoost with hyperparameter tuning.
    • Feature transformation is applied using OneHotEncoder for categorical variables and StandardScaler for numerical features.
    • To address class imbalance, ADASYN (Adaptive Synthetic Sampling) is applied to the training dataset.
    • Hyperparameter tuning for XGBoost is performed using GridSearchCV.
    • Experiment tracking is managed using MLflow, logging model parameters, performance metrics, and visualizations such as confusion matrices and ROC curves.
    • The trained models and transformers are saved for deployment.
  4. Final Model Evaluation on Test Data:

    • The evaluate.py script loads the trained models and transformer to evaluate final performance on the separate test dataset.
    • Key performance metrics include accuracy, F1-score, and AUC-ROC score, which are logged into MLflow for experiment tracking.
    • This stage provides the final validation of model performance before deployment.

Technologies Used

  • Python, pandas, scikit-learn, XGBoost, MLflow
  • DVC for dataset version control
  • ADASYN for handling imbalanced data
  • Matplotlib for visualization (confusion matrix, ROC curves)

How to Run the Pipeline

Follow these steps to set up and execute the machine learning pipeline:

  1. Install Dependencies:

    pip install -r requirements.txt
  2. Set Up MLflow Credentials:

    • Create a .env file in the project root directory.
    • Refer to Sample.env for the required credentials format.
  3. Initialize DVC:

    dvc init
  4. Run the Pipeline:

    dvc repro

    This command will automatically execute all stages defined in dvc.yaml in the correct order.

  5. Track Changes (Optional): If you see a message prompting you to track changes, run:

    git add dvc.lock models/.gitignore
  6. Push Data & Models to Remote Storage (Optional): If using a remote DVC storage, push your data and models:

    dvc push

About

End-to-end machine learning pipeline for customer churn prediction on imbalanced data. Includes ADASYN oversampling, DVC for version control, and MLflow for experiment tracking and model management.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published