Skip to content

Graph-Tabular Fusion for Bitcoin Fraud Detection - Demonstrating when Node2Vec embeddings don't improve XGBoost. Scientifically rigorous negative result validating that tabular features encode graph structure.

License

Notifications You must be signed in to change notification settings

BhaveshBytess/GraphTabular-FraudFusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”— When Graph Embeddings Don't Help

Graph-Tabular Fusion on Elliptic++ Bitcoin Fraud Detection

License: MIT Python 3.8+ PyTorch XGBoost Code style: black


🎯 TL;DR

Graph embeddings are supposed to enhance tabular models. But on Elliptic++ Bitcoin fraud detection, adding Node2Vec embeddings to XGBoost actually decreases performance by 2%.

This repository demonstrates why β€” and validates that rich tabular features already encode graph structure, making explicit graph embeddings redundant.


πŸ’‘ The Key Finding

Main Result: XGBoost (tabular-only) achieves PR-AUC 0.669, while XGBoost + Node2Vec (fusion) achieves only 0.656.

Why? Features AF1–AF93 (local transaction attributes) combined with the baseline's AF94–AF182 (neighbor aggregates) already capture graph topology.

Conclusion: Graph embeddings don't add value when tabular features already encode neighborhood information β€” a negative result that's scientifically valuable.

Fusion vs Baseline

Figure 1: Direct comparison showing fusion underperforms baseline tabular-only XGBoost.


πŸ“Š Performance Comparison

We trained a fusion model using strict temporal splits (no leakage) on the Elliptic++ dataset:

Model Features PR-AUC ⭐ ROC-AUC F1 Recall@1%
XGBoost (Baseline) Tabular only (AF1-93) 0.669 πŸ† 0.888 0.699 -
XGBoost + Node2Vec Tabular + 64-dim embeddings 0.656 ⚠️ 0.861 0.688 17.5%
Random Forest Tabular only 0.658 0.877 0.694 -
MLP Tabular only 0.364 0.830 0.486 -

Model Comparison

Figure 2: Multi-metric comparison across models. Fusion (blue) consistently underperforms baseline XGBoost (green).

⚠️ Key Insight: The 2% performance drop (0.669 β†’ 0.656) when adding graph embeddings indicates that tabular features already capture neighborhood information effectively.


πŸ—οΈ Architecture & Pipeline

Pipeline

Figure 3: Graph-Tabular Fusion pipeline showing leakage-free embedding generation and feature concatenation.

Fusion Protocol A (Implemented):

  1. Temporal splits: 60% train / 20% val / 20% test (from baseline)
  2. Graph embeddings: Node2Vec (64-dim) generated per-split to prevent leakage
  3. Tabular features: Local features (AF1-93) to avoid double-encoding
  4. Fusion: Concatenate embeddings + features β†’ 157 total dimensions
  5. Model: XGBoost with early stopping on validation PR-AUC

Leakage Prevention:

  • βœ… Embeddings computed separately for train/val/test using within-split edges only
  • βœ… No future information used in random walks
  • βœ… Same temporal splits as baseline for fair comparison

πŸ” Why Fusion Didn't Help

Feature Contribution

Figure 4: Feature contribution analysis showing embeddings don't improve over tabular features.

Three reasons embeddings are redundant:

  1. Tabular features already encode graph structure

    • Local features (AF1-93) capture transaction characteristics
    • Baseline aggregate features (AF94-182) explicitly encode neighbor statistics
    • Graph topology is implicitly represented in the data
  2. Node2Vec embeddings approximate what features already have

    • Random walk embeddings learn neighborhood structure
    • Similar patterns to pre-computed neighbor aggregates
    • No unique signal beyond tabular representation
  3. Rich feature engineering beats architectural complexity

    • 166 engineered features per node (local + aggregates)
    • Significant domain knowledge encoded in features
    • Graph structure less informative than node attributes

πŸ“ˆ Results Summary

Summary Table

Figure 5: Comprehensive results summary comparing fusion vs baseline.

Key Metrics (Test Set):

  • PR-AUC: 0.656 vs 0.669 baseline (-2%)
  • ROC-AUC: 0.861 vs 0.888 baseline (-3%)
  • F1 Score: 0.688 vs 0.699 baseline (-2%)
  • Training Time: Similar (~2 minutes on CPU)
  • Features: 157 vs 93 (+64 embeddings added no value)

Validation Performance:

  • PR-AUC: 0.965 (excellent learning)
  • Slight overfitting from validation to test

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • 2GB disk space for dataset + embeddings
  • Optional: GPU for faster embedding generation (CPU works, ~30 min)

Installation & Reproduction

# 1️⃣ Clone repository
git clone https://github.com/BhaveshBytess/GraphTabular-FraudFusion.git
cd GraphTabular-FraudFusion

# 2️⃣ Setup environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# 3️⃣ Download Elliptic++ dataset (NOT included)
# Get from: https://drive.google.com/drive/folders/1MRPXz79Lu_JGLlJ21MDfML44dKN9R08l
# Place files in: data/Elliptic++ Dataset/
#   β”œβ”€β”€ txs_features.csv
#   β”œβ”€β”€ txs_classes.csv
#   └── txs_edgelist.csv

# 4️⃣ Verify dataset
python src/data/verify_dataset.py "data/Elliptic++ Dataset"

# 5️⃣ Generate embeddings (~30 min on CPU, ~5 min on GPU)
python scripts/generate_embeddings.py

# 6️⃣ Train fusion model (~2 min)
python scripts/train_fusion.py

# 7️⃣ View results
ls reports/  # Metrics and model
ls reports/plots/  # Visualizations

Expected Output:

  • Embeddings: data/embeddings.parquet (70 MB, 203K nodes Γ— 64 dims)
  • Model: reports/xgb_fusion.json
  • Metrics: reports/metrics.json (PR-AUC β‰ˆ 0.656 Β± 0.01)

πŸ“ Repository Structure

graph-tabular-fusion/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ Elliptic++ Dataset/       # User-provided (see Quick Start)
β”‚   └── embeddings.parquet         # Generated Node2Vec embeddings
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_generate_embeddings.ipynb    # Kaggle-ready
β”‚   β”œβ”€β”€ 02_fusion_xgb.ipynb             # Kaggle-ready
β”‚   └── 03_ablation_studies.ipynb       # Optional experiments
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data/                      # Loaders, splits, verification
β”‚   β”œβ”€β”€ embeddings/                # Node2Vec implementation
β”‚   β”œβ”€β”€ train/                     # XGBoost fusion trainer
β”‚   β”œβ”€β”€ eval/                      # Comparison reports
β”‚   └── utils/                     # Metrics, seeding, logging
β”œβ”€β”€ configs/                       # YAML configurations
β”œβ”€β”€ reports/
β”‚   β”œβ”€β”€ metrics.json               # Evaluation results
β”‚   β”œβ”€β”€ metrics_summary.csv        # Consolidated comparison
β”‚   β”œβ”€β”€ plots/                     # Visualizations
β”‚   └── xgb_fusion.json            # Trained model
β”œβ”€β”€ scripts/                       # Execution pipelines
└── docs/                          # Specifications, provenance

πŸ”¬ Experimental Rigor

Reproducibility βœ…

  • Seed: 42 (fixed for all random operations)
  • Splits: Temporal 60/20/20 (imported from baseline)
  • Embeddings: Deterministic Node2Vec (fixed seed)
  • Metrics: Same evaluation protocol as baseline

Leakage Prevention βœ…

  • Per-split embedding generation: Train/val/test embeddings computed independently
  • Within-split edges only: No cross-split information in random walks
  • Temporal isolation: No future information leaks to past

Fair Comparison βœ…

  • Same splits as baseline (exact txId alignment)
  • Same metrics (PR-AUC, ROC-AUC, F1, Recall@K)
  • Same class weighting (computed from training data)
  • No hyperparameter tuning (baseline config reused)

πŸ’‘ Key Takeaways

For Practitioners 🏭

  1. Use tabular features alone - simpler, faster, equally effective
  2. Graph embeddings β‰  automatic improvement - validate with strong baselines
  3. Feature engineering > model complexity for fraud detection
  4. XGBoost on rich features often beats sophisticated graph methods

For Researchers πŸŽ“

  1. Negative results are valuable - demonstrate when fusion doesn't help
  2. Baseline comparison is critical - always test against best tabular methods
  3. Feature redundancy matters - check what's already in your data
  4. Honest reporting builds credibility - report findings, not hopes

For ML Engineers πŸ‘¨β€πŸ’»

  1. Production-ready: Simpler XGBoost preferred (no embeddings needed)
  2. Deployment: Tabular-only approach easier to maintain and debug
  3. Cost: Save computation (no embedding generation required)
  4. Interpretability: XGBoost feature importance more actionable

πŸŽ“ Scientific Contribution

This work contributes:

  1. Empirical validation of when graph methods don't help
  2. Rigorous methodology for fusion model evaluation
  3. Honest reporting of negative results (often unpublished)
  4. Reproducible pipeline for graph-tabular fusion experiments
  5. Portfolio demonstration of scientific thinking and rigor

Publication-worthy aspects:

  • Leakage-free temporal evaluation framework
  • Comprehensive baseline comparison
  • Clear interpretation of negative results
  • Reproducible experimental design
  • Practical guidance for practitioners

πŸ“š Related Work & Baseline

This extension builds on the baseline project:

Baseline Repository: FRAUD-DETECTION-GNN

Baseline Finding:

"XGBoost (tabular) beats GraphSAGE (GNN) by 49% because features already encode neighbor information."

This Extension Validates:

"Adding explicit graph embeddings doesn't help because tabular features already capture graph structure."

Provenance: See docs/baseline_provenance.json for baseline commit SHA and imported artifacts.


πŸ›£οΈ Future Work (Optional Extensions)

Not implemented but interesting:

  • Protocol B: Test with full features (AF1-182) + embeddings
  • Embedding dimensions: Sweep 16/32/128 (does size matter?)
  • GraphSAGE export: Compare supervised vs unsupervised embeddings
  • MLP fusion learner: Alternative to XGBoost
  • Explainability: SHAP analysis on fusion features
  • Temporal embeddings: Time-aware graph learning
  • Cross-dataset: Test on Ethereum phishing networks

πŸ“– Citation

If you use this work, please cite:

@software{graphtabular_fusion_2025,
  title={Graph-Tabular Fusion on Elliptic++ Bitcoin Fraud Detection},
  author={Your Name},
  year={2025},
  url={https://github.com/BhaveshBytess/GraphTabular-FraudFusion}
}

Dataset Citation:

@article{weber2019anti,
  title={Anti-money laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics},
  author={Weber, Mark and Domeniconi, Giacomo and Chen, Jie and Weidele, Daniel Karl I and Bellei, Claudio and Robinson, Tom and Leiserson, Charles E},
  journal={arXiv preprint arXiv:1908.02591},
  year={2019}
}

πŸ“„ License

MIT License - See LICENSE for details.

Educational/demonstrative use. Respect Elliptic++ dataset terms and conditions.


πŸ™ Acknowledgments

  • Elliptic for the Elliptic++ dataset
  • Baseline project for splits, metrics, and utilities
  • PyTorch Geometric & XGBoost communities
  • NetworkX & Gensim for Node2Vec implementation

πŸ“§ Contact

For questions, issues, or collaboration:

  • GitHub Issues: Open an issue
  • Email: [Your email]
  • LinkedIn: [Your profile]

⭐ Star this repo if you find it useful for understanding when graph methods don't help!

Status: βœ… Complete (E1-E3) | πŸ“Š Results validated | πŸŽ“ Portfolio-ready


Last Updated: November 2025 | Version: 1.0.0