TL;DR
Binary classification project to predict customer churn for Interconnect (telecom). The goal is to identify at-risk customers so marketing can run targeted retention offers (discounts, special plans). This repo contains the analysis notebook, modelling experiments and reproducible instructions. Data is not included for privacy reasons β see data/README.md.
Churn prediction helps reduce voluntary customer cancellations and increase lifetime value. We frame the problem as a supervised classification task: given customer attributes, services and contract details, predict whether a customer will churn in the next billing period.
Business goal: maximize recall/precision on the top decile (Precision@K) to prioritise retention budget efficiently.
Files provided by Interconnect (example names):
contract.csvβ contract length, monthly charges, payment method, contract start.personal.csvβ demographics, tenure, region.internet.csvβ internet service type (DSL/fibre), add-ons (ProteccionDeDispositivo, SeguridadEnLinea).phone.csvβ phone service usage, multiple lines.
Each file has customerID as the unique key. A small anonymized sample is available under data/sample/ for testing.
- Python (Pandas, NumPy)
- Scikit-Learn (pipelines, metrics)
- XGBoost / LightGBM (tree-based models)
- Matplotlib (plots)
- Jupyter Notebook
(See requirements.txt for package list.)
- Data ingestion & join by
customerID. - Exploratory Data Analysis (missing values, distributions, correlations).
- Feature engineering: tenure buckets, interaction flags, monthly charge aggregations, service counts.
- Class imbalance handling: class weights and sampling strategies.
- Temporal / stratified splitting to avoid leakage.
- Model training: baseline Logistic Regression β RandomForest / XGBoost β final XGBoost model.
- Evaluation: AUC-ROC, Precision@K (business threshold), confusion matrix, calibration.
- Explainability: feature importance and SHAP (optional).
- AUC-ROC = 0.911 β exceeds the project target (β₯ 0.88).
- Accuracy = 0.868 β consistent with a classifier that separates both classes well.
- F1 = 0.757 β computed at the optimal validation threshold.
- ROC & PR curves confirm strong ranking of predicted probabilities; as recall increases, precision decreases progressively (trade-off captured and studied for business thresholds).
Business interpretation: Using the top decile of predicted churners (Precision@10%) enables the marketing team to focus retention incentives where expected uplift is highest, maximising ROI on retention spend.
git clone https://github.com/<YOUR_USER>/<REPO_NAME>.git
cd <REPO_NAME>
python -m venv .venv
# mac / linux
source .venv/bin/activate
# windows
# .venv\Scripts\activate
pip install -r requirements.txt
jupyter notebook notebooks/01_Churn_Prediction.ipynb