Skip to content

FraudDetectAI is an advanced credit card fraud detection system built with XGBoost and Hybrid SMOTE Sampling (Oversampling + Undersampling). This project tackles highly imbalanced datasets, ensuring strong fraud detection accuracy while minimizing overfitting risks.

License

Notifications You must be signed in to change notification settings

otuemre/FraudDetectAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

39 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FraudDetectAI - A Fraud Detection AI

Python License pandas numpy matplotlib seaborn jupyter scikit-learn xgboost optuna imbalanced-learn shap

Table of Contents

Project Overview

FraudDetectAI is a machine learning-based system designed to detect fraudulent credit card transactions.
Using imbalanced learning techniques, feature engineering, and powerful models like XGBoost, this project aims to build a highly accurate fraud detection system.


Objectives

  • Understand the nature of fraudulent transactions.
  • Handle highly imbalanced data effectively.
  • Build a robust classification model for fraud detection.
  • Use explainability techniques (SHAP) to interpret model predictions.

Steps We Have Covered

Exploratory Data Analysis (EDA) βœ… (Completed)

πŸ”Ή Dataset overview and missing value analysis.
πŸ”Ή Fraud vs. Non-Fraud transaction distribution.
πŸ”Ή Transaction amount & time distribution analysis.
πŸ”Ή PCA feature importance analysis (V1-V28).
πŸ”Ή Identified top 5 PCA features for fraud detection.

Data Preprocessing & Feature Engineering βœ… (Completed)

πŸ”Ή Feature Scaling β†’ Standardized all numerical features.
πŸ”Ή Handling Class Imbalance β†’ Applied SMOTE to balance fraud & non-fraud cases.
πŸ”Ή Feature Selection β†’ Verified that all features are useful (none were removed).
πŸ”Ή Final Dataset Check β†’ Confirmed dataset shape, missing values, and class distribution.
πŸ”Ή Final Decision: Kept all features after verifying correlation and importance.

Note (March 3, 2025):
The SMOTE model was overfitting.
We adjusted SMOTE ratios and combined it with undersampling to improve generalization.

Model Training & Evaluation βœ… (Completed)

πŸ”Ή Trained multiple XGBoost models (Base, Weighted, SMOTE, Hybrid).
πŸ”Ή Hyperparameter tuning using Optuna.
πŸ”Ή Evaluated models using precision-recall, AUC-ROC, and confusion matrices.
πŸ”Ή SMOTE XGBoost was overfitting, Hybrid SMOTE fixed this issue.

Feature Importance Analysis βœ… (Completed)

πŸ”Ή Used SHAP (SHapley Additive Explanations) to analyze model decisions.
πŸ”Ή Generated SHAP Summary Plot β†’ Visualizing overall feature impact.
πŸ”Ή Created SHAP Decision & Waterfall Plots β†’ Understanding individual fraud predictions.
πŸ”Ή V4, V14, V12 emerged as key fraud indicators.

Overfitting Analysis βœ… (Completed)

πŸ”Ή Compared Train vs. Test Performance β†’ Checked precision, recall, F1-score, and AUC-ROC.
πŸ”Ή Plotted Learning Curves β†’ Visualized training & validation loss trends.
πŸ”Ή Confirmed Overfitting in SMOTE XGBoost β†’ Training loss was nearly 0.0, but recall was artificially high.
πŸ”Ή Final Decision: Modified SMOTE with Hybrid Sampling (Oversampling + Undersampling).

SMOTE Adjustment βœ… (Completed)

πŸ”Ή Tested multiple SMOTE ratios (e.g., 70:30, 60:40) instead of full 1:1 balancing.
πŸ”Ή Applied Hybrid Sampling β†’ Combined undersampling & SMOTE to prevent overfitting.
πŸ”Ή Re-trained XGBoost models β†’ Verified that the new dataset improved performance.
πŸ”Ή Final Decision: Hybrid SMOTE XGBoost is the best-performing model!


Final Model Performance

After applying Hybrid SMOTE (Oversampling + Undersampling), the final model achieved the following results:

SMOTE Hybrid XGBoost Model Performance:

Metric Class 0 (Non-Fraud) Class 1 (Fraud) Overall
Precision 0.98 0.99 -
Recall 1.00 0.96 -
F1-Score 0.99 0.97 -
AUC-ROC - - 0.9982
  • Overfitting has been significantly reduced!
  • The model generalizes much better while maintaining high recall!

Final Learning Curve

The learning curve for the Hybrid SMOTE XGBoost model shows smooth convergence, meaning the model is no longer overfitting:

Learning Curve - Hybrid SMOTE


Findings So Far

EDA Highlights:

  • Fraud transactions are extremely rare (0.17%), making imbalance handling crucial.
  • Fraud transactions occur more often at night (1 AM - 6 AM).
  • Certain PCA features (V17, V14, V12, V10, V16) strongly correlate with fraud.
  • Fraud transactions tend to have different distributions in key features.

Preprocessing Summary:

  • Applied StandardScaler for feature scaling.
  • Tested different SMOTE strategies (Baseline, Weighted, Hybrid).
  • Hybrid SMOTE (Oversampling + Undersampling) significantly improved model performance.
  • Final dataset check passed β†’ No missing values, dataset is balanced & ready for training.

Feature Importance Summary:

  • SHAP analysis confirmed that V4, V14, and V12 are key fraud indicators.
  • Transaction amount also has a moderate influence on fraud detection.
  • Decision plots reveal how fraud risk increases with certain feature values.

Overfitting Summary:

  • SMOTE XGBoost was overfitting (near-zero training loss, high recall but unrealistic performance).
  • Hybrid SMOTE significantly improved generalization while maintaining strong fraud detection capability.

Final Steps Before Deployment

  1. Save & document the final model.
  2. Prepare for model deployment.
  3. Wrap up the project with a final report.

Project Structure

FraudDetectAI/
│── src/
β”‚   β”œβ”€β”€ datasets/           # Raw and processed datasets
β”‚   β”œβ”€β”€ models/             # Trained models
β”‚   β”œβ”€β”€ notebooks/          # Jupyter notebooks for analysis
β”‚   β”œβ”€β”€ images/             # Saved visualizations
β”‚   β”œβ”€β”€ reports/            # Project reports & summaries
│── .gitignore              # Ignore the uncessary files
│── README.md               # Project documentation
│── requirements.txt        # Dependencies

Model Download

You can download the trained XGBoost Hybrid Model from:

GitHub Repository: src/models/xgb_hybrid.pkl

Acknowledgments

This project is based on the Credit Card Fraud Detection dataset from mlg-ulb.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

FraudDetectAI is an advanced credit card fraud detection system built with XGBoost and Hybrid SMOTE Sampling (Oversampling + Undersampling). This project tackles highly imbalanced datasets, ensuring strong fraud detection accuracy while minimizing overfitting risks.

Topics

Resources

License

Stars

Watchers

Forks