Proof of Concept - A self-optimizing vector database that uses MAP-Elites evolutionary algorithm to automatically discover the optimal index configuration for your workload.
51-82x faster than competitors. 100% recall on real embeddings.
Embeddings are numerical representations of data (text, images, audio) that capture semantic meaning. Similar concepts have similar embeddings, enabling semantic search - finding results by meaning rather than exact keyword matches.
New to embeddings? Watch this excellent explainer: What are Embeddings?
"How do I fix a flat tire?" β [0.12, -0.34, 0.56, ...] (768 numbers)
"Changing a punctured wheel" β [0.11, -0.33, 0.55, ...] (very similar!)
"Best pizza in NYC" β [0.89, 0.12, -0.45, ...] (very different)
EmergentDB stores these vectors and finds the most similar ones at blazing speed.
As embedding dimensions grow (768-3072), traditional vector databases struggle:
- Manual Tuning Hell: HNSW M=16? M=32? ef_construction=100? Most teams guess and hope.
- Workload Mismatch: Optimal config for 1K vectors β optimal for 100K. Databases don't adapt.
- Recall vs Speed: Fast search often means lower recall. You shouldn't have to choose.
EmergentDB uses a Dual Quality-Diversity System:
- IndexQD - 3D behavior space (Recall Γ Latency Γ Memory) evolves optimal index type and hyperparameters
- InsertQD - 2D behavior space (Throughput Γ Efficiency) discovers fastest SIMD insertion strategy
The system automatically selects between HNSW, Flat, and IVF indices with evolved hyperparameters, achieving maximum search speed while enforcing a 99% recall floor.
768-dimensional Gemini embeddings (real semantic vectors, not random):
| Database | Search Latency | Recall@10 | Speedup |
|---|---|---|---|
| EmergentDB (HNSW m=8) | 44ΞΌs | 100% | baseline |
| EmergentDB (HNSW m=16) | 102ΞΌs | 100% | - |
| ChromaDB | 2,259ΞΌs | 99.8% | 51x slower |
| LanceDB | 3,590ΞΌs | 84.3% | 82x slower |
EmergentDB: 51x faster than ChromaDB, 82x faster than LanceDB
Random vectors suffer from the "curse of dimensionality" - all points become equidistant, making ANN algorithms appear broken. Real embeddings from Gemini have semantic structure, allowing HNSW to achieve 100% recall even with aggressive parameters.
See tests/methodology.md for detailed benchmark methodology.
# In-memory mode (default)
cargo run --release -p api-server
# With persistence (vectors survive restart)
DATA_DIR=./data cargo run --release -p api-server
# Custom settings
PORT=8080 VECTOR_DIM=768 DATA_DIR=./mydata cargo run --release -p api-serveruse vector_core::index::emergent::{EmergentConfig, EmergentIndex};
// Create with search-optimized preset
let config = EmergentConfig::search_first();
let mut index = EmergentIndex::new(config);
// Insert your vectors
for (id, embedding) in vectors {
index.insert(id, embedding)?;
}
// Evolve to find optimal configuration
let elite = index.evolve()?;
println!("Selected: {} (fitness: {:.3})",
elite.genome.index_type, elite.fitness);
// Search - now 44-193x faster
let results = index.search(&query, 10)?;// Maximum search speed (default)
EmergentConfig::search_first() // 50% recall, 40% speed, 5% memory, 5% build
// Balanced (equal weight to all objectives)
EmergentConfig::balanced()
// Memory-constrained environments
EmergentConfig::memory_efficient()EmergentDB includes a pre-evolved grid of industry-standard configurations. Use these for instant optimal performance without evolution time:
use vector_core::{PrecomputedElitesGrid, EmergentIndex, EmergentConfig};
let mut index = EmergentIndex::new(EmergentConfig::fast());
// Insert your vectors
for (id, embedding) in vectors {
index.insert(id, embedding)?;
}
// Apply best configuration for your scale
let grid = PrecomputedElitesGrid::new();
let elite = grid.recommend(vectors.len(), "balanced");
index.apply_precomputed_elite(elite)?;
// Now search at 44ΞΌs with 100% recall
let results = index.search(&query, 10)?;| Priority | Configuration | Parameters | Expected Recall | Best For |
|---|---|---|---|---|
speed |
Ultra-fast HNSW | m=8, ef=50 | 75-100% | Low-latency apps |
balanced |
OpenSearch default | m=16, ef=100 | 92-100% | General use |
accuracy |
High-recall HNSW | m=24, ef=200 | 98-100% | Precision-critical |
max |
Research-grade | m=48, ef=500 | 99-100% | Maximum quality |
These configurations are derived from production systems:
| Source | Use Case |
|---|---|
| OpenSearch k-NN | Low/Medium/High configs |
| Milvus | Recommended HNSW params |
| Pinecone | Performance-tuned settings |
| Research literature | Maximum quality benchmarks |
let grid = PrecomputedElitesGrid::new();
// Get all configurations matching your scale
let matching = grid.get_for_scale(10_000); // Returns 9 configs for 10K vectors
// Get specific recommendation
let speed_config = grid.recommend(10_000, "speed"); // m=8, ef=50
let balanced = grid.recommend(10_000, "balanced"); // m=16, ef=100
let accuracy = grid.recommend(10_000, "accuracy"); // m=24, ef=200| Index | Complexity | Best For |
|---|---|---|
| Flat | O(N) | Small datasets, exact search baseline |
| HNSW | O(log N) | High recall requirements |
| IVF | O(N/partitions) | Large datasets with clustering |
| PQ | O(N) compressed | Memory-constrained environments |
| Emergent | Adaptive | Automatic optimization |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EmergentDB β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β IndexQD β β InsertQD β β Archive β β
β β (3D Grid) β β (2D Grid) β β (Elites) β β
β β β β β β β β
β β Recall β β Throughput β β Best configs β β
β β Latency β β Efficiency β β per cell β β
β β Memory β β β β β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β β β
β ββββββββββββββββββββ΄ββββββββββββββββββββ β
β β β
β βββββββββββββββββββ β
β β MAP-Elites β β
β β Evolution β β
β β (6Β³ = 216 β β
β β cells) β β
β βββββββββββββββββββ β
β β β
β βββββββββββββββββββ β
β β Optimal Index β β
β β Selection β β
β βββββββββββββββββββ β
β β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β HNSW β β Flat β β IVF β β
β β M, ef_c, β β Brute β β nlist, β β
β β ef_search β β Force β β nprobe β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
EmergentDB supports durable storage via RocksDB. When DATA_DIR is set:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PERSISTENCE MODE β
β β
β Search: RAM ββSIMDβββ Results (42ΞΌs, unchanged!) β
β β β
β βββ Loaded from disk on startup β
β β
β Insert: RAM + async write β Disk (non-blocking) β
β β
β Restart: Automatic recovery from RocksDB β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key benefits:
- Vectors survive server restarts
- No impact on search performance (still in-memory SIMD)
- Automatic recovery on startup
- LZ4 compression for efficient storage
Configurations with <99% recall get cubic fitness penalty, ensuring accuracy is never sacrificed for speed.
Six insertion strategies compete in InsertQD:
- SimdSequential
- SimdBatch
- SimdParallel
- SimdChunked (L2 cache-friendly)
- SimdUnrolled (4-way loop unrolling)
- SimdInterleaved (two-pass for memory bandwidth)
Best strategy automatically selected: 5.6M vectors/second on modern CPUs.
EmergentDB automatically selects:
- HNSW: For larger datasets (>5K vectors) - evolved M, ef_construction, ef_search
- Flat: For small datasets or when recall is paramount
- IVF: For very large datasets with memory constraints
The visualization dashboard shows real-time benchmark comparisons with Airfoil design:
cd frontend
bun install
bun run devOpen http://localhost:3000 to see the interactive benchmark visualization.
# Generate embeddings (requires GEMINI_API_KEY)
cd tests
export GEMINI_API_KEY="your-key"
python3 gemini_embedding_benchmark.py
python3 scale_gemini_embeddings.py
# Run Rust benchmark with real embeddings
cargo run --release --example gemini_benchmark -p vector-core -- 10000cd tests
python3 full_comparison_benchmark.py # Compares EmergentDB vs LanceDB vs ChromaDBcargo run --release --example scale_benchmarkemergentdb/
βββ crates/
β βββ vector-core/ # Core vector index library
β β βββ src/
β β βββ index/
β β β βββ emergent.rs # MAP-Elites + PrecomputedElitesGrid
β β β βββ hnsw.rs # HNSW index
β β β βββ flat.rs # Flat index
β β β βββ ivf.rs # IVF index
β β βββ simd.rs # SIMD insert strategies
β βββ api-server/ # REST API server
βββ frontend/ # Next.js visualization dashboard
βββ public/ # Static assets (benchmark graphics)
βββ examples/
β βββ gemini_benchmark.rs # Benchmark with real embeddings
β βββ index_benchmark.rs # Precomputed grid testing
βββ tests/
βββ methodology.md # Benchmark methodology documentation
βββ gemini_embedding_benchmark.py
βββ scale_gemini_embeddings.py
βββ full_comparison_benchmark.py
βββ benchmark_results/
EmergentDB uses platform-specific SIMD for maximum performance:
- Apple Silicon (M1-M4): ARM NEON with fused multiply-accumulate
- x86_64: Wide SIMD via the
widecrate - Fallback: Scalar operations
Build for your platform:
# Apple M-series
RUSTFLAGS="-C target-cpu=apple-m1" cargo build --release
# Generic release
cargo build --releaseThe optimal configuration emerges from evolutionary pressure. Instead of hand-tuning hyperparameters, EmergentDB:
- Generates diverse configurations
- Evaluates fitness on your actual data
- Keeps the best per behavior cell (MAP-Elites)
- Crosses over successful genomes
- Mutates to explore new configurations
- Repeats until convergence
The result: a configuration perfectly adapted to your specific workload, data distribution, and hardware.
| Field | Default | Description |
|---|---|---|
dim |
1536 | Vector dimensionality |
metric |
Cosine | Distance metric |
grid_size |
6 | MAP-Elites grid resolution (6Β³ = 216 cells) |
generations |
10 | Evolution iterations |
population_size |
10 | Candidates per generation |
eval_sample_size |
1000 | Vectors for benchmarking |
benchmark_queries |
100 | Queries for recall measurement |
HNSW:
m: Neighbors per node (8-48)ef_construction: Build-time candidates (100-400)ef_search: Search-time candidates (20-200)
IVF:
num_partitions: Cluster count (64-1024)nprobe: Partitions to search (4-64)kmeans_iterations: Training iterations
# Run all tests
cargo test --workspace
# Run benchmarks
cargo bench
# Check compilation
cargo check --workspaceAGPL-3.0 - See LICENSE for details.
- HNSW: "Efficient and robust approximate nearest neighbor search" (Malkov & Yashunin, 2016)
- MAP-Elites: "Illuminating search spaces" (Mouret & Clune, 2015)
- Product Quantization: "Product Quantization for Nearest Neighbor Search" (JΓ©gou et al., 2011)
- IVF: FAISS
- LanceDB: lancedb.com
- ChromaDB: trychroma.com
