Skip to content

Conversation

@atris
Copy link
Contributor

@atris atris commented Jan 26, 2026

Adds Lucene99SpannVectorsFormat, implementing Disk-Resident HNSW-IVF (SPANN) to support vector indices larger than available heap.

This PR segregatse the index into a Coarse Quantizer (Centroids in HNSW) and the actual Data (Disk-resident inverted lists).

Writer flow: Buffers vectors in heap, then runs K-Means++ on flush (using reservoir sampling for amortised performance). Writes centroids to the delegate format and vector data sequentially to .spad files.

Reader flow: Performs a two-phase search. First phase queries HNSW for the nearest nprobe partitions. Second phase uses the chosen centroids from first phase and scans the candidate partitions on disk.

Testing strategy includes unit tests for integrity, clustering correctness, and a recall validation test confirming that higher n-probe retrieves better results.

Note: Merging currently uses the default implementation and requires heap proportional to segment size. Disk-based merging is a follow-up.

atris added 5 commits January 14, 2026 21:18
Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
Copy link
Contributor

@vigyasharma vigyasharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial thoughts. Curious what kind of recall - latency performance you've seen with these changes. Please share some benchmarks when you have them.

Comment on lines 210 to 212
for (int i = 0; i < target.length; i++) {
floatTarget[i] = (float) target[i];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an i+=4 i.e. jump by 4 bytes when converting it to floats. I've seen us use ByteBuffers in Lucene for these things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original loop was casting 8-bit integer dimensions to floats, not reinterpreting a byte stream. However, to eliminate this ambiguity entirely and align with other formats, I’ve removed the mixed-type support.
search(byte[]) now strictly throws if run against a FLOAT32 index.


TopDocs topCentroids;
if (entry.fieldInfo.getVectorEncoding() == VectorEncoding.BYTE) {
byte[] byteTarget = new byte[target.length];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused on the need for handling these cases - float target with byte vectors. Are you always storing the postings in byte format? (but I see a vice versa conversion as well). From what I know, the only place we have this currently is when vectors are explicitly stored as bytes, e.g. with scalar quantized vectors. In which case, the target is converted to bytes when it is quantized.

Copy link
Contributor Author

@atris atris Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the search side, the code enforces strict type matching. search()explicitly checks the field encoding and throws IllegalArgumentException if you try to query a FLOAT32 field with bytes or vice versa.
searchFine accepts both arguments solely to deduplicate the traversal implementation; it strictly relies on the segment's encoding to decide which target to use. There is no implicit conversion or quantization.

On the writer side, we lift byte vectors to floats in memory to run K-Means since centroid averaging requires continuous space, then cast back to bytes for the on-disk posting lists. The reader strictness matches the final on-disk format.

/**
* Buffers vectors in memory until flush.
*
* <p>Future improvements could include off-heap or disk-backed buffering (e.g. ByteBlockPool) to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: How do you feel about keeping these ideas in github issues instead of code comments? IMO it's okay to call out a performance bottleneck TODO in comments, but naming specific techniques leaves future readers wondering whether and where they were implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong preference here. Will remove them.

AcceptDocs acceptDocs)
throws IOException {
IndexInput dataIn = entry.dataIn.clone();
for (ScoreDoc centroidDoc : topCentroids.scoreDocs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid looking at all the topK centroid postings, and short-circuit if dist(centroid, target) > nearest_centroid_distance + small_threshold ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I've implemented dynamic pruning in searchFine. It now checks if (bestCentroidScore - currentScore > DYNAMIC_PRUNING_THRESHOLD) and skip the entire partition if it's significantly worse than the best candidate.

// Cap the number of partitions at configured limit (default 100)
int numPartitions = Math.min(vectorArray.length, maxPartitions);

// Downsample to keep flush time constant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Let's treat this downsampling as a "clustering impl" detail and move it into the clustering class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the reservoir sampling logic out of the Writer and into SpannKMeans#downsample.

float bestScore = Float.NEGATIVE_INFINITY;

for (int c = 0; c < centroids.length; c++) {
float score = simFunc.compare(vector, centroids[c]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k-means clustering would've done assigned some vectors to partitions already. Do we need to compare every vector against every centroid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only train on a sample of the flush (e.g. 16k vectors), treating the final assignment as a fresh pass over all vectors is simpler and safer than tracking partial pre-assignments. I prefer this simplicity for the initial version but will defer to your advise.

atris added 3 commits January 27, 2026 22:59
- Skip processing candidate partitions that fall below a score threshold relative to the best candidate, improving search latency.

- Enforce consistency between query and index vector types; removes mixed-mode support to simplify search paths.

- Benchmarks: Verified recall and memory usage (Hot vs Cold) via JMH. Benchmark output is made verbose to distinguish RSS-driving structures from streaming data.

- Centralize vector downsampling in SpannKMeans.

- Add parity tests to ANN correctness against brute-force search.

Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
@mikemccand
Copy link
Member

Please share some benchmarks when you have them.

+1 -- luceneutil now supports Cohere v3 1024 dimension vectors, ~40 MM vectors, unit sphere normalized.

  tests and JMH open-reader benchmark

Signed-off-by: Atri Sharma <[email protected]>
@atris atris closed this Jan 28, 2026
@atris atris reopened this Jan 28, 2026
Signed-off-by: Atri Sharma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants