Fraud Detection - Naive Approach (Without Feature Store)

Overview

This is the naive implementation of a fraud detection system WITHOUT using a feature store like Featureform. It demonstrates the problems that arise when building ML systems without proper feature infrastructure.

What's Included

naive/
├── docker-compose.yml       # Simple Postgres setup
├── load_data.py            # Load data with train/test split
├── train_naive.py          # Train model with inline features
├── inference_naive.py      # Run predictions (SLOW!)
└── README.md              # This file

Quick Start

Set up Kaggle (MacOS)

Get Kaggle API key from Kaggle website

mkdir ~/.kaggle/
mv kaggle.json ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json

# Create virtual environment
python3.9 -m venv venv

# Activate
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

1. Start Postgres

cd naive/
docker-compose up -d

2. Download Dataset

You need the IEEE-CIS Fraud Detection dataset from Kaggle:

# Setup Kaggle credentials first
pip install kaggle
python download_dataset.py

3. Load Data

# Quick mode: 50K rows
python load_data.py --quick

# Full dataset: 590K rows
python load_data.py

4. Train Model

python train_naive.py

5. Run Inference

# Single prediction (random test transaction)
python inference_naive.py --random

# Specific transaction
python inference_naive.py --transaction-id 2987000

# Batch predictions (watch it get slow!)
python inference_naive.py --batch 20

The Problems (Why You Need Featureform)

Problem 1: Duplicated Feature Logic

In Training (train_naive.py):

def compute_features(df):
    # Card aggregates
    card_stats = df.groupby('card1').agg({
        'transaction_amt': ['count', 'mean'],
        'is_fraud': 'mean'
    })
    df = df.merge(card_stats, on='card1')
    ...

In Inference (inference_naive.py):

def compute_features_for_inference(transaction_id, conn):
    # Must duplicate the SAME logic
    card_query = """
        SELECT COUNT(*) as card_transaction_count,
               AVG(transaction_amt) as card_avg_amt,
               AVG(is_fraud) as card_fraud_rate
        FROM transactions WHERE card1 = ...
    """
    ...

Result:

✗ Easy to introduce bugs (logic diverges over time)
✗ Training-serving skew (different implementations)
✗ No version control
✗ Hard to maintain

Problem 2: Slow Inference

Every prediction requires:

Query all transactions for the card → SLOW
Query all transactions for the email → SLOW
Compute aggregates from scratch → SLOW
No caching whatsoever → SLOW

Example output:

Feature computation: 847.23ms (97.8%)
Model inference:      18.45ms ( 2.2%)
Total:               865.68ms

⚠️  97.8% of time spent recomputing features!

Scaling problems:

20 predictions = 17 seconds
→ 1000 predictions = 14 minutes
→ 100,000 predictions = 24 hours

This is NOT suitable for production!

Problem 3: No Point-in-Time Correctness

In the naive approach, you could accidentally use future data:

# WRONG: This includes transactions AFTER the one we're predicting!
card_stats = df.groupby('card1')['is_fraud'].mean()

This causes data leakage and inflates model performance artificially.

Problem 4: No Feature Versioning

When you update features:

Old models break
No rollback capability
Hard to A/B test features
Can't compare model versions fairly

Problem 5: Database Gets Hammered

Every inference runs expensive aggregate queries:

Database CPU spikes
Other queries get slow
Not scalable
Production DB at risk

Performance Comparison

Naive Approach (This Implementation)

Metric	Value
Single prediction latency	500-1000ms
Throughput	1-2 predictions/sec
Database load	Very high
Feature computation	95%+ of time
Scalability	Poor

With Featureform (Feature Store)

Metric	Value
Single prediction latency	10-50ms
Throughput	20-100 predictions/sec
Database load	Low (pre-computed)
Feature computation	<10% of time
Scalability	Excellent

Result: 10-100x faster with Featureform!

What Featureform Solves

1. Pre-Computed Features (Fast!)

# Features are materialized once
# Served from Redis (milliseconds)
features = client.features(...).serve(entity_id)

2. Single Source of Truth

# Define feature ONCE
@ff.entity
class Transaction:
    card_avg_amt = ff.Feature(
        transformation[["entity_id", "card_avg_amt"]],
        variant="v1"
    )

3. Automatic Versioning

# Training uses v1
training_set = client.training_set("fraud_detection", "v1")

# Deploy v2 without breaking v1
@ff.feature(variant="v2")
def card_avg_amt_v2():
    ...

4. Point-in-Time Correctness

# Featureform automatically handles temporal correctness
# No data leakage possible

5. Monitoring & Lineage

Track feature usage across models
Monitor feature drift
Understand dependencies
Debug issues quickly

Try Both Approaches

Naive (This Implementation)

cd naive/
python inference_naive.py --batch 20
# Watch: ~17 seconds for 20 predictions

With Featureform

cd ../  # Go to featureform implementation
python inference.py --batch 20
# Watch: ~1 second for 20 predictions

Real-World Impact

Scenario: E-commerce Fraud Detection

Without Featureform:

800ms per prediction
Can handle ~1,200 transactions/hour per server
Need 8 servers for 10,000 transactions/hour
Cost: ~$3,200/month (8 × $400)
Risk: Database overload during peak times

With Featureform:

20ms per prediction
Can handle ~180,000 transactions/hour per server
Need 1 server for 10,000 transactions/hour
Cost: ~$400/month
Risk: Minimal, scales easily

Savings: $2,800/month + operational headaches

The Bottom Line

Without a Feature Store (Naive)

✗ Slow predictions (100ms - 1000ms) ✗ Duplicated feature logic ✗ Risk of training-serving skew ✗ Database becomes bottleneck ✗ Hard to maintain ✗ Not scalable ✗ No versioning ✗ Manual point-in-time handling

With Featureform

✓ Fast predictions (10-50ms) ✓ Single feature definition ✓ Guaranteed consistency ✓ Database barely touched ✓ Easy to maintain ✓ Horizontally scalable ✓ Automatic versioning ✓ Point-in-time correctness built-in

Next Steps

Run the naive approach to see the problems firsthand
Compare with Featureform implementation in parent directory
Measure the difference in latency and throughput
Calculate the ROI for your use case

Questions?

This naive implementation intentionally shows bad practices to demonstrate why feature stores like Featureform are essential for production ML systems.

For the proper Featureform implementation, see the parent directory.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
experiment_results		experiment_results
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
download_dataset.py		download_dataset.py
inference_naive.py		inference_naive.py
load_data.py		load_data.py
requirements.txt		requirements.txt
run_experiment.py		run_experiment.py
setup_venv.sh		setup_venv.sh
train_naive.py		train_naive.py

redis-applied-ai/3_naive-ieee-cis-fd

Folders and files

Latest commit

History

Repository files navigation

Fraud Detection - Naive Approach (Without Feature Store)

Overview

What's Included

Quick Start

Set up Kaggle (MacOS)

1. Start Postgres

2. Download Dataset

3. Load Data

4. Train Model

5. Run Inference

The Problems (Why You Need Featureform)

Problem 1: Duplicated Feature Logic

Problem 2: Slow Inference

Problem 3: No Point-in-Time Correctness

Problem 4: No Feature Versioning

Problem 5: Database Gets Hammered

Performance Comparison

Naive Approach (This Implementation)

With Featureform (Feature Store)

What Featureform Solves

1. Pre-Computed Features (Fast!)

2. Single Source of Truth

3. Automatic Versioning

4. Point-in-Time Correctness

5. Monitoring & Lineage

Try Both Approaches

Naive (This Implementation)

With Featureform

Real-World Impact

Scenario: E-commerce Fraud Detection

The Bottom Line

Without a Feature Store (Naive)

With Featureform

Next Steps

Questions?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages