Skip to content

redis-applied-ai/3_naive-ieee-cis-fd

Repository files navigation

Fraud Detection - Naive Approach (Without Feature Store)

Overview

This is the naive implementation of a fraud detection system WITHOUT using a feature store like Featureform. It demonstrates the problems that arise when building ML systems without proper feature infrastructure.

What's Included

naive/
├── docker-compose.yml       # Simple Postgres setup
├── load_data.py            # Load data with train/test split
├── train_naive.py          # Train model with inline features
├── inference_naive.py      # Run predictions (SLOW!)
└── README.md              # This file

Quick Start

Set up Kaggle (MacOS)

Get Kaggle API key from Kaggle website

mkdir ~/.kaggle/
mv kaggle.json ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
# Create virtual environment
python3.9 -m venv venv

# Activate
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

1. Start Postgres

cd naive/
docker-compose up -d

2. Download Dataset

You need the IEEE-CIS Fraud Detection dataset from Kaggle:

# Setup Kaggle credentials first
pip install kaggle
python download_dataset.py

3. Load Data

# Quick mode: 50K rows
python load_data.py --quick

# Full dataset: 590K rows
python load_data.py

4. Train Model

python train_naive.py

5. Run Inference

# Single prediction (random test transaction)
python inference_naive.py --random

# Specific transaction
python inference_naive.py --transaction-id 2987000

# Batch predictions (watch it get slow!)
python inference_naive.py --batch 20

The Problems (Why You Need Featureform)

Problem 1: Duplicated Feature Logic

In Training (train_naive.py):

def compute_features(df):
    # Card aggregates
    card_stats = df.groupby('card1').agg({
        'transaction_amt': ['count', 'mean'],
        'is_fraud': 'mean'
    })
    df = df.merge(card_stats, on='card1')
    ...

In Inference (inference_naive.py):

def compute_features_for_inference(transaction_id, conn):
    # Must duplicate the SAME logic
    card_query = """
        SELECT COUNT(*) as card_transaction_count,
               AVG(transaction_amt) as card_avg_amt,
               AVG(is_fraud) as card_fraud_rate
        FROM transactions WHERE card1 = ...
    """
    ...

Result:

  • ✗ Easy to introduce bugs (logic diverges over time)
  • ✗ Training-serving skew (different implementations)
  • ✗ No version control
  • ✗ Hard to maintain

Problem 2: Slow Inference

Every prediction requires:

  1. Query all transactions for the card → SLOW
  2. Query all transactions for the email → SLOW
  3. Compute aggregates from scratch → SLOW
  4. No caching whatsoever → SLOW

Example output:

Feature computation: 847.23ms (97.8%)
Model inference:      18.45ms ( 2.2%)
Total:               865.68ms

⚠️  97.8% of time spent recomputing features!

Scaling problems:

20 predictions = 17 seconds
→ 1000 predictions = 14 minutes
→ 100,000 predictions = 24 hours

This is NOT suitable for production!

Problem 3: No Point-in-Time Correctness

In the naive approach, you could accidentally use future data:

# WRONG: This includes transactions AFTER the one we're predicting!
card_stats = df.groupby('card1')['is_fraud'].mean()

This causes data leakage and inflates model performance artificially.

Problem 4: No Feature Versioning

When you update features:

  • Old models break
  • No rollback capability
  • Hard to A/B test features
  • Can't compare model versions fairly

Problem 5: Database Gets Hammered

Every inference runs expensive aggregate queries:

  • Database CPU spikes
  • Other queries get slow
  • Not scalable
  • Production DB at risk

Performance Comparison

Naive Approach (This Implementation)

Metric Value
Single prediction latency 500-1000ms
Throughput 1-2 predictions/sec
Database load Very high
Feature computation 95%+ of time
Scalability Poor

With Featureform (Feature Store)

Metric Value
Single prediction latency 10-50ms
Throughput 20-100 predictions/sec
Database load Low (pre-computed)
Feature computation <10% of time
Scalability Excellent

Result: 10-100x faster with Featureform!

What Featureform Solves

1. Pre-Computed Features (Fast!)

# Features are materialized once
# Served from Redis (milliseconds)
features = client.features(...).serve(entity_id)

2. Single Source of Truth

# Define feature ONCE
@ff.entity
class Transaction:
    card_avg_amt = ff.Feature(
        transformation[["entity_id", "card_avg_amt"]],
        variant="v1"
    )

3. Automatic Versioning

# Training uses v1
training_set = client.training_set("fraud_detection", "v1")

# Deploy v2 without breaking v1
@ff.feature(variant="v2")
def card_avg_amt_v2():
    ...

4. Point-in-Time Correctness

# Featureform automatically handles temporal correctness
# No data leakage possible

5. Monitoring & Lineage

  • Track feature usage across models
  • Monitor feature drift
  • Understand dependencies
  • Debug issues quickly

Try Both Approaches

Naive (This Implementation)

cd naive/
python inference_naive.py --batch 20
# Watch: ~17 seconds for 20 predictions

With Featureform

cd ../  # Go to featureform implementation
python inference.py --batch 20
# Watch: ~1 second for 20 predictions

Real-World Impact

Scenario: E-commerce Fraud Detection

Without Featureform:

  • 800ms per prediction
  • Can handle ~1,200 transactions/hour per server
  • Need 8 servers for 10,000 transactions/hour
  • Cost: ~$3,200/month (8 × $400)
  • Risk: Database overload during peak times

With Featureform:

  • 20ms per prediction
  • Can handle ~180,000 transactions/hour per server
  • Need 1 server for 10,000 transactions/hour
  • Cost: ~$400/month
  • Risk: Minimal, scales easily

Savings: $2,800/month + operational headaches

The Bottom Line

Without a Feature Store (Naive)

✗ Slow predictions (100ms - 1000ms) ✗ Duplicated feature logic ✗ Risk of training-serving skew ✗ Database becomes bottleneck ✗ Hard to maintain ✗ Not scalable ✗ No versioning ✗ Manual point-in-time handling

With Featureform

✓ Fast predictions (10-50ms) ✓ Single feature definition ✓ Guaranteed consistency ✓ Database barely touched ✓ Easy to maintain ✓ Horizontally scalable ✓ Automatic versioning ✓ Point-in-time correctness built-in

Next Steps

  1. Run the naive approach to see the problems firsthand
  2. Compare with Featureform implementation in parent directory
  3. Measure the difference in latency and throughput
  4. Calculate the ROI for your use case

Questions?

This naive implementation intentionally shows bad practices to demonstrate why feature stores like Featureform are essential for production ML systems.

For the proper Featureform implementation, see the parent directory.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published