Add SPANN (Disk Resident HNSW-IVF) Implementation #15613

atris · 2026-01-26T14:12:11Z

Adds Lucene99SpannVectorsFormat, implementing Disk-Resident HNSW-IVF (SPANN) to support vector indices larger than available heap.

This PR segregatse the index into a Coarse Quantizer (Centroids in HNSW) and the actual Data (Disk-resident inverted lists).

Writer flow: Buffers vectors in heap, then runs K-Means++ on flush (using reservoir sampling for amortised performance). Writes centroids to the delegate format and vector data sequentially to .spad files.

Reader flow: Performs a two-phase search. First phase queries HNSW for the nearest nprobe partitions. Second phase uses the chosen centroids from first phase and scans the candidate partitions on disk.

Testing strategy includes unit tests for integrity, clustering correctness, and a recall validation test confirming that higher n-probe retrieves better results.

Note: Merging currently uses the default implementation and requires heap proportional to segment size. Disk-based merging is a follow-up.

Signed-off-by: Atri Sharma <[email protected]>

vigyasharma

Left some initial thoughts. Curious what kind of recall - latency performance you've seen with these changes. Please share some benchmarks when you have them.

vigyasharma · 2026-01-26T18:19:59Z

lucene/core/src/java/org/apache/lucene/codecs/spann/Lucene99SpannVectorsReader.java

+    for (int i = 0; i < target.length; i++) {
+      floatTarget[i] = (float) target[i];
+    }


Do we need an i+=4 i.e. jump by 4 bytes when converting it to floats. I've seen us use ByteBuffers in Lucene for these things.

The original loop was casting 8-bit integer dimensions to floats, not reinterpreting a byte stream. However, to eliminate this ambiguity entirely and align with other formats, I’ve removed the mixed-type support.
search(byte[]) now strictly throws if run against a FLOAT32 index.

vigyasharma · 2026-01-26T18:25:26Z

lucene/core/src/java/org/apache/lucene/codecs/spann/Lucene99SpannVectorsReader.java

+
+    TopDocs topCentroids;
+    if (entry.fieldInfo.getVectorEncoding() == VectorEncoding.BYTE) {
+      byte[] byteTarget = new byte[target.length];


I'm confused on the need for handling these cases - float target with byte vectors. Are you always storing the postings in byte format? (but I see a vice versa conversion as well). From what I know, the only place we have this currently is when vectors are explicitly stored as bytes, e.g. with scalar quantized vectors. In which case, the target is converted to bytes when it is quantized.

On the search side, the code enforces strict type matching. search()explicitly checks the field encoding and throws IllegalArgumentException if you try to query a FLOAT32 field with bytes or vice versa.
searchFine accepts both arguments solely to deduplicate the traversal implementation; it strictly relies on the segment's encoding to decide which target to use. There is no implicit conversion or quantization.

On the writer side, we lift byte vectors to floats in memory to run K-Means since centroid averaging requires continuous space, then cast back to bytes for the on-disk posting lists. The reader strictness matches the final on-disk format.

vigyasharma · 2026-01-26T18:27:51Z

lucene/core/src/java/org/apache/lucene/codecs/spann/SpannFieldVectorsWriter.java

+/**
+ * Buffers vectors in memory until flush.
+ *
+ * <p>Future improvements could include off-heap or disk-backed buffering (e.g. ByteBlockPool) to


Minor: How do you feel about keeping these ideas in github issues instead of code comments? IMO it's okay to call out a performance bottleneck TODO in comments, but naming specific techniques leaves future readers wondering whether and where they were implemented.

No strong preference here. Will remove them.

vigyasharma · 2026-01-26T18:58:55Z

lucene/core/src/java/org/apache/lucene/codecs/spann/Lucene99SpannVectorsReader.java

+      AcceptDocs acceptDocs)
+      throws IOException {
+    IndexInput dataIn = entry.dataIn.clone();
+    for (ScoreDoc centroidDoc : topCentroids.scoreDocs) {


Can we avoid looking at all the topK centroid postings, and short-circuit if dist(centroid, target) > nearest_centroid_distance + small_threshold ?

Great catch. I've implemented dynamic pruning in searchFine. It now checks if (bestCentroidScore - currentScore > DYNAMIC_PRUNING_THRESHOLD) and skip the entire partition if it's significantly worse than the best candidate.

vigyasharma · 2026-01-26T19:01:11Z

lucene/core/src/java/org/apache/lucene/codecs/spann/Lucene99SpannVectorsWriter.java

+      // Cap the number of partitions at configured limit (default 100)
+      int numPartitions = Math.min(vectorArray.length, maxPartitions);
+
+      // Downsample to keep flush time constant


Minor: Let's treat this downsampling as a "clustering impl" detail and move it into the clustering class?

Refactored the reservoir sampling logic out of the Writer and into SpannKMeans#downsample.

vigyasharma · 2026-01-27T13:02:04Z

lucene/core/src/java/org/apache/lucene/codecs/spann/Lucene99SpannVectorsWriter.java

+        float bestScore = Float.NEGATIVE_INFINITY;
+
+        for (int c = 0; c < centroids.length; c++) {
+          float score = simFunc.compare(vector, centroids[c]);


k-means clustering would've done assigned some vectors to partitions already. Do we need to compare every vector against every centroid?

Since we only train on a sample of the flush (e.g. 16k vectors), treating the final assignment as a fresh pass over all vectors is simpler and safer than tracking partial pre-assignments. I prefer this simplicity for the initial version but will defer to your advise.

- Skip processing candidate partitions that fall below a score threshold relative to the best candidate, improving search latency. - Enforce consistency between query and index vector types; removes mixed-mode support to simplify search paths. - Benchmarks: Verified recall and memory usage (Hot vs Cold) via JMH. Benchmark output is made verbose to distinguish RSS-driving structures from streaming data. - Centralize vector downsampling in SpannKMeans. - Add parity tests to ANN correctness against brute-force search. Signed-off-by: Atri Sharma <[email protected]>

Signed-off-by: Atri Sharma <[email protected]>

mikemccand · 2026-01-27T20:45:33Z

Please share some benchmarks when you have them.

+1 -- luceneutil now supports Cohere v3 1024 dimension vectors, ~40 MM vectors, unit sphere normalized.

tests and JMH open-reader benchmark Signed-off-by: Atri Sharma <[email protected]>

Signed-off-by: Atri Sharma <[email protected]>

atris added 5 commits January 14, 2026 21:18

Initial MVP implementation of SPANN (HNSW-IVF)

a7f94ce

Signed-off-by: Atri Sharma <[email protected]>

More test fixes

dcb4f7e

Signed-off-by: Atri Sharma <[email protected]>

More test fixes and indentation fixes

46e03d1

Signed-off-by: Atri Sharma <[email protected]>

Misc cleanup

99af5d3

Signed-off-by: Atri Sharma <[email protected]>

Gradle tidy changes

9cb4fdb

Signed-off-by: Atri Sharma <[email protected]>

github-actions bot added module:core/search module:core/codecs labels Jan 26, 2026

Check fixes

6bd633a

Signed-off-by: Atri Sharma <[email protected]>

This was referenced Jan 26, 2026

Cluster Based ANN Vector Search for Lucene #15612

Open

RFC: SPANN Vector Format for Apache Lucene #14997

Open

vigyasharma reviewed Jan 27, 2026

View reviewed changes

atris added 3 commits January 27, 2026 22:59

Fix unsafe math

b4feff7

Signed-off-by: Atri Sharma <[email protected]>

Add missing resources directory

056deb2

Signed-off-by: Atri Sharma <[email protected]>

temp-file buffering, safe sampling/byte handling, centroid flush; add

830f43a

tests and JMH open-reader benchmark Signed-off-by: Atri Sharma <[email protected]>

atris closed this Jan 28, 2026

atris reopened this Jan 28, 2026

Fix large spill test

d16da62

Signed-off-by: Atri Sharma <[email protected]>

Add SPANN (Disk Resident HNSW-IVF) Implementation #15613

Are you sure you want to change the base?

Add SPANN (Disk Resident HNSW-IVF) Implementation #15613

Conversation

atris commented Jan 26, 2026

Uh oh!

vigyasharma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atris Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikemccand commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

atris Jan 27, 2026 •

edited

Loading