- Project Overview
- Objectives
- Steps We Have Covered
- Final Model Performance
- Final Learning Curve
- Findings So Far
- Final Steps Before Deployment
- Project Structure
- Model Download
- Acknowledgments
- License
FraudDetectAI is a machine learning-based system designed to detect fraudulent credit card transactions.
Using imbalanced learning techniques, feature engineering, and powerful models like XGBoost, this project aims to build a highly accurate fraud detection system.
- Understand the nature of fraudulent transactions.
- Handle highly imbalanced data effectively.
- Build a robust classification model for fraud detection.
- Use explainability techniques (SHAP) to interpret model predictions.
πΉ Dataset overview and missing value analysis.
πΉ Fraud vs. Non-Fraud transaction distribution.
πΉ Transaction amount & time distribution analysis.
πΉ PCA feature importance analysis (V1-V28).
πΉ Identified top 5 PCA features for fraud detection.
πΉ Feature Scaling β Standardized all numerical features.
πΉ Handling Class Imbalance β Applied SMOTE to balance fraud & non-fraud cases.
πΉ Feature Selection β Verified that all features are useful (none were removed).
πΉ Final Dataset Check β Confirmed dataset shape, missing values, and class distribution.
πΉ Final Decision: Kept all features after verifying correlation and importance.
Note (March 3, 2025):
The SMOTE model was overfitting.
We adjusted SMOTE ratios and combined it with undersampling to improve generalization.
πΉ Trained multiple XGBoost models (Base, Weighted, SMOTE, Hybrid).
πΉ Hyperparameter tuning using Optuna.
πΉ Evaluated models using precision-recall, AUC-ROC, and confusion matrices.
πΉ SMOTE XGBoost was overfitting, Hybrid SMOTE fixed this issue.
πΉ Used SHAP (SHapley Additive Explanations) to analyze model decisions.
πΉ Generated SHAP Summary Plot β Visualizing overall feature impact.
πΉ Created SHAP Decision & Waterfall Plots β Understanding individual fraud predictions.
πΉ V4, V14, V12 emerged as key fraud indicators.
πΉ Compared Train vs. Test Performance β Checked precision, recall, F1-score, and AUC-ROC.
πΉ Plotted Learning Curves β Visualized training & validation loss trends.
πΉ Confirmed Overfitting in SMOTE XGBoost β Training loss was nearly 0.0, but recall was artificially high.
πΉ Final Decision: Modified SMOTE with Hybrid Sampling (Oversampling + Undersampling).
πΉ Tested multiple SMOTE ratios (e.g., 70:30, 60:40) instead of full 1:1 balancing.
πΉ Applied Hybrid Sampling β Combined undersampling & SMOTE to prevent overfitting.
πΉ Re-trained XGBoost models β Verified that the new dataset improved performance.
πΉ Final Decision: Hybrid SMOTE XGBoost is the best-performing model!
After applying Hybrid SMOTE (Oversampling + Undersampling), the final model achieved the following results:
| Metric | Class 0 (Non-Fraud) | Class 1 (Fraud) | Overall |
|---|---|---|---|
| Precision | 0.98 | 0.99 | - |
| Recall | 1.00 | 0.96 | - |
| F1-Score | 0.99 | 0.97 | - |
| AUC-ROC | - | - | 0.9982 |
- Overfitting has been significantly reduced!
- The model generalizes much better while maintaining high recall!
The learning curve for the Hybrid SMOTE XGBoost model shows smooth convergence, meaning the model is no longer overfitting:
EDA Highlights:
- Fraud transactions are extremely rare (0.17%), making imbalance handling crucial.
- Fraud transactions occur more often at night (1 AM - 6 AM).
- Certain PCA features (V17, V14, V12, V10, V16) strongly correlate with fraud.
- Fraud transactions tend to have different distributions in key features.
Preprocessing Summary:
- Applied StandardScaler for feature scaling.
- Tested different SMOTE strategies (Baseline, Weighted, Hybrid).
- Hybrid SMOTE (Oversampling + Undersampling) significantly improved model performance.
- Final dataset check passed β No missing values, dataset is balanced & ready for training.
Feature Importance Summary:
- SHAP analysis confirmed that V4, V14, and V12 are key fraud indicators.
- Transaction amount also has a moderate influence on fraud detection.
- Decision plots reveal how fraud risk increases with certain feature values.
Overfitting Summary:
- SMOTE XGBoost was overfitting (near-zero training loss, high recall but unrealistic performance).
- Hybrid SMOTE significantly improved generalization while maintaining strong fraud detection capability.
- Save & document the final model.
- Prepare for model deployment.
- Wrap up the project with a final report.
FraudDetectAI/
βββ src/
β βββ datasets/ # Raw and processed datasets
β βββ models/ # Trained models
β βββ notebooks/ # Jupyter notebooks for analysis
β βββ images/ # Saved visualizations
β βββ reports/ # Project reports & summaries
βββ .gitignore # Ignore the uncessary files
βββ README.md # Project documentation
βββ requirements.txt # Dependencies
You can download the trained XGBoost Hybrid Model from:
GitHub Repository: src/models/xgb_hybrid.pkl
This project is based on the Credit Card Fraud Detection dataset from mlg-ulb.
This project is licensed under the MIT License - see the LICENSE file for details.
