Skip to content

🧬 Self-optimizing vector database using MAP-Elites. 51x faster than ChromaDB, 82x faster than LanceDB, 100% recall. Auto-evolves optimal index configs via Quality Diversity.

License

Notifications You must be signed in to change notification settings

justrach/emergentDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

EmergentDB

Proof of Concept - A self-optimizing vector database that uses MAP-Elites evolutionary algorithm to automatically discover the optimal index configuration for your workload.

51-82x faster than competitors. 100% recall on real embeddings.

EmergentDB Benchmark Results

What are Embeddings?

Embeddings are numerical representations of data (text, images, audio) that capture semantic meaning. Similar concepts have similar embeddings, enabling semantic search - finding results by meaning rather than exact keyword matches.

New to embeddings? Watch this excellent explainer: What are Embeddings?

"How do I fix a flat tire?" β†’ [0.12, -0.34, 0.56, ...]  (768 numbers)
"Changing a punctured wheel"  β†’ [0.11, -0.33, 0.55, ...]  (very similar!)
"Best pizza in NYC"           β†’ [0.89, 0.12, -0.45, ...]  (very different)

EmergentDB stores these vectors and finds the most similar ones at blazing speed.

The Problem

As embedding dimensions grow (768-3072), traditional vector databases struggle:

  • Manual Tuning Hell: HNSW M=16? M=32? ef_construction=100? Most teams guess and hope.
  • Workload Mismatch: Optimal config for 1K vectors β‰  optimal for 100K. Databases don't adapt.
  • Recall vs Speed: Fast search often means lower recall. You shouldn't have to choose.

The Solution

EmergentDB uses a Dual Quality-Diversity System:

  1. IndexQD - 3D behavior space (Recall Γ— Latency Γ— Memory) evolves optimal index type and hyperparameters
  2. InsertQD - 2D behavior space (Throughput Γ— Efficiency) discovers fastest SIMD insertion strategy

The system automatically selects between HNSW, Flat, and IVF indices with evolved hyperparameters, achieving maximum search speed while enforcing a 99% recall floor.

Benchmark Results

768-dimensional Gemini embeddings (real semantic vectors, not random):

10,000 Vectors

Database Search Latency Recall@10 Speedup
EmergentDB (HNSW m=8) 44ΞΌs 100% baseline
EmergentDB (HNSW m=16) 102ΞΌs 100% -
ChromaDB 2,259ΞΌs 99.8% 51x slower
LanceDB 3,590ΞΌs 84.3% 82x slower

EmergentDB: 51x faster than ChromaDB, 82x faster than LanceDB

Why Real Embeddings Matter

Random vectors suffer from the "curse of dimensionality" - all points become equidistant, making ANN algorithms appear broken. Real embeddings from Gemini have semantic structure, allowing HNSW to achieve 100% recall even with aggressive parameters.

See tests/methodology.md for detailed benchmark methodology.

Quick Start

Running the Server

# In-memory mode (default)
cargo run --release -p api-server

# With persistence (vectors survive restart)
DATA_DIR=./data cargo run --release -p api-server

# Custom settings
PORT=8080 VECTOR_DIM=768 DATA_DIR=./mydata cargo run --release -p api-server

Using the Rust Library

use vector_core::index::emergent::{EmergentConfig, EmergentIndex};

// Create with search-optimized preset
let config = EmergentConfig::search_first();
let mut index = EmergentIndex::new(config);

// Insert your vectors
for (id, embedding) in vectors {
    index.insert(id, embedding)?;
}

// Evolve to find optimal configuration
let elite = index.evolve()?;
println!("Selected: {} (fitness: {:.3})",
    elite.genome.index_type, elite.fitness);

// Search - now 44-193x faster
let results = index.search(&query, 10)?;

Configuration Presets

// Maximum search speed (default)
EmergentConfig::search_first()  // 50% recall, 40% speed, 5% memory, 5% build

// Balanced (equal weight to all objectives)
EmergentConfig::balanced()

// Memory-constrained environments
EmergentConfig::memory_efficient()

Precomputed MAP-Elites Grid

EmergentDB includes a pre-evolved grid of industry-standard configurations. Use these for instant optimal performance without evolution time:

use vector_core::{PrecomputedElitesGrid, EmergentIndex, EmergentConfig};

let mut index = EmergentIndex::new(EmergentConfig::fast());

// Insert your vectors
for (id, embedding) in vectors {
    index.insert(id, embedding)?;
}

// Apply best configuration for your scale
let grid = PrecomputedElitesGrid::new();
let elite = grid.recommend(vectors.len(), "balanced");
index.apply_precomputed_elite(elite)?;

// Now search at 44ΞΌs with 100% recall
let results = index.search(&query, 10)?;

Available Configurations

Priority Configuration Parameters Expected Recall Best For
speed Ultra-fast HNSW m=8, ef=50 75-100% Low-latency apps
balanced OpenSearch default m=16, ef=100 92-100% General use
accuracy High-recall HNSW m=24, ef=200 98-100% Precision-critical
max Research-grade m=48, ef=500 99-100% Maximum quality

Configuration Sources

These configurations are derived from production systems:

Source Use Case
OpenSearch k-NN Low/Medium/High configs
Milvus Recommended HNSW params
Pinecone Performance-tuned settings
Research literature Maximum quality benchmarks

Recommendations by Dataset Size

let grid = PrecomputedElitesGrid::new();

// Get all configurations matching your scale
let matching = grid.get_for_scale(10_000);  // Returns 9 configs for 10K vectors

// Get specific recommendation
let speed_config = grid.recommend(10_000, "speed");      // m=8, ef=50
let balanced = grid.recommend(10_000, "balanced");       // m=16, ef=100
let accuracy = grid.recommend(10_000, "accuracy");       // m=24, ef=200

Index Types

Index Complexity Best For
Flat O(N) Small datasets, exact search baseline
HNSW O(log N) High recall requirements
IVF O(N/partitions) Large datasets with clustering
PQ O(N) compressed Memory-constrained environments
Emergent Adaptive Automatic optimization

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        EmergentDB                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚   IndexQD    β”‚    β”‚   InsertQD   β”‚    β”‚   Archive    β”‚   β”‚
β”‚  β”‚  (3D Grid)   β”‚    β”‚  (2D Grid)   β”‚    β”‚  (Elites)    β”‚   β”‚
β”‚  β”‚              β”‚    β”‚              β”‚    β”‚              β”‚   β”‚
β”‚  β”‚ Recall       β”‚    β”‚ Throughput   β”‚    β”‚ Best configs β”‚   β”‚
β”‚  β”‚ Latency      β”‚    β”‚ Efficiency   β”‚    β”‚ per cell     β”‚   β”‚
β”‚  β”‚ Memory       β”‚    β”‚              β”‚    β”‚              β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚           β”‚                  β”‚                   β”‚           β”‚
β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                              β”‚                               β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚                    β”‚  MAP-Elites     β”‚                       β”‚
β”‚                    β”‚  Evolution      β”‚                       β”‚
β”‚                    β”‚  (6Β³ = 216      β”‚                       β”‚
β”‚                    β”‚   cells)        β”‚                       β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β”‚                              β”‚                               β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚                    β”‚  Optimal Index  β”‚                       β”‚
β”‚                    β”‚  Selection      β”‚                       β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β”‚                              β”‚                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚     HNSW     β”‚  β”‚     Flat     β”‚  β”‚     IVF      β”‚       β”‚
β”‚  β”‚  M, ef_c,    β”‚  β”‚   Brute      β”‚  β”‚  nlist,      β”‚       β”‚
β”‚  β”‚  ef_search   β”‚  β”‚   Force      β”‚  β”‚  nprobe      β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Persistence

EmergentDB supports durable storage via RocksDB. When DATA_DIR is set:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    PERSISTENCE MODE                         β”‚
β”‚                                                             β”‚
β”‚   Search: RAM ──SIMD──→ Results   (42ΞΌs, unchanged!)       β”‚
β”‚           ↑                                                 β”‚
β”‚           └── Loaded from disk on startup                   β”‚
β”‚                                                             β”‚
β”‚   Insert: RAM + async write β†’ Disk  (non-blocking)         β”‚
β”‚                                                             β”‚
β”‚   Restart: Automatic recovery from RocksDB                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key benefits:

  • Vectors survive server restarts
  • No impact on search performance (still in-memory SIMD)
  • Automatic recovery on startup
  • LZ4 compression for efficient storage

Key Features

99% Recall Floor

Configurations with <99% recall get cubic fitness penalty, ensuring accuracy is never sacrificed for speed.

SIMD-Optimized Insertions

Six insertion strategies compete in InsertQD:

  • SimdSequential
  • SimdBatch
  • SimdParallel
  • SimdChunked (L2 cache-friendly)
  • SimdUnrolled (4-way loop unrolling)
  • SimdInterleaved (two-pass for memory bandwidth)

Best strategy automatically selected: 5.6M vectors/second on modern CPUs.

Adaptive Index Selection

EmergentDB automatically selects:

  • HNSW: For larger datasets (>5K vectors) - evolved M, ef_construction, ef_search
  • Flat: For small datasets or when recall is paramount
  • IVF: For very large datasets with memory constraints

Running the Frontend

The visualization dashboard shows real-time benchmark comparisons with Airfoil design:

cd frontend
bun install
bun run dev

Open http://localhost:3000 to see the interactive benchmark visualization.

Running Benchmarks

Gemini Embedding Benchmark (Recommended)

# Generate embeddings (requires GEMINI_API_KEY)
cd tests
export GEMINI_API_KEY="your-key"
python3 gemini_embedding_benchmark.py
python3 scale_gemini_embeddings.py

# Run Rust benchmark with real embeddings
cargo run --release --example gemini_benchmark -p vector-core -- 10000

Full Comparison Benchmark

cd tests
python3 full_comparison_benchmark.py  # Compares EmergentDB vs LanceDB vs ChromaDB

Scale Benchmark (Rust)

cargo run --release --example scale_benchmark

Project Structure

emergentdb/
β”œβ”€β”€ crates/
β”‚   β”œβ”€β”€ vector-core/           # Core vector index library
β”‚   β”‚   └── src/
β”‚   β”‚       β”œβ”€β”€ index/
β”‚   β”‚       β”‚   β”œβ”€β”€ emergent.rs    # MAP-Elites + PrecomputedElitesGrid
β”‚   β”‚       β”‚   β”œβ”€β”€ hnsw.rs        # HNSW index
β”‚   β”‚       β”‚   β”œβ”€β”€ flat.rs        # Flat index
β”‚   β”‚       β”‚   └── ivf.rs         # IVF index
β”‚   β”‚       └── simd.rs            # SIMD insert strategies
β”‚   └── api-server/            # REST API server
β”œβ”€β”€ frontend/                  # Next.js visualization dashboard
β”œβ”€β”€ public/                    # Static assets (benchmark graphics)
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ gemini_benchmark.rs    # Benchmark with real embeddings
β”‚   └── index_benchmark.rs     # Precomputed grid testing
└── tests/
    β”œβ”€β”€ methodology.md         # Benchmark methodology documentation
    β”œβ”€β”€ gemini_embedding_benchmark.py
    β”œβ”€β”€ scale_gemini_embeddings.py
    β”œβ”€β”€ full_comparison_benchmark.py
    └── benchmark_results/

SIMD Optimization

EmergentDB uses platform-specific SIMD for maximum performance:

  • Apple Silicon (M1-M4): ARM NEON with fused multiply-accumulate
  • x86_64: Wide SIMD via the wide crate
  • Fallback: Scalar operations

Build for your platform:

# Apple M-series
RUSTFLAGS="-C target-cpu=apple-m1" cargo build --release

# Generic release
cargo build --release

Why "Emergent"?

The optimal configuration emerges from evolutionary pressure. Instead of hand-tuning hyperparameters, EmergentDB:

  1. Generates diverse configurations
  2. Evaluates fitness on your actual data
  3. Keeps the best per behavior cell (MAP-Elites)
  4. Crosses over successful genomes
  5. Mutates to explore new configurations
  6. Repeats until convergence

The result: a configuration perfectly adapted to your specific workload, data distribution, and hardware.

Configuration Options

EmergentConfig

Field Default Description
dim 1536 Vector dimensionality
metric Cosine Distance metric
grid_size 6 MAP-Elites grid resolution (6Β³ = 216 cells)
generations 10 Evolution iterations
population_size 10 Candidates per generation
eval_sample_size 1000 Vectors for benchmarking
benchmark_queries 100 Queries for recall measurement

Index-Specific Configs

HNSW:

  • m: Neighbors per node (8-48)
  • ef_construction: Build-time candidates (100-400)
  • ef_search: Search-time candidates (20-200)

IVF:

  • num_partitions: Cluster count (64-1024)
  • nprobe: Partitions to search (4-64)
  • kmeans_iterations: Training iterations

Testing

# Run all tests
cargo test --workspace

# Run benchmarks
cargo bench

# Check compilation
cargo check --workspace

License

AGPL-3.0 - See LICENSE for details.

References

About

🧬 Self-optimizing vector database using MAP-Elites. 51x faster than ChromaDB, 82x faster than LanceDB, 100% recall. Auto-evolves optimal index configs via Quality Diversity.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published