-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add SPANN (Disk Resident HNSW-IVF) Implementation #15613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
vigyasharma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial thoughts. Curious what kind of recall - latency performance you've seen with these changes. Please share some benchmarks when you have them.
| for (int i = 0; i < target.length; i++) { | ||
| floatTarget[i] = (float) target[i]; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an i+=4 i.e. jump by 4 bytes when converting it to floats. I've seen us use ByteBuffers in Lucene for these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original loop was casting 8-bit integer dimensions to floats, not reinterpreting a byte stream. However, to eliminate this ambiguity entirely and align with other formats, I’ve removed the mixed-type support.
search(byte[]) now strictly throws if run against a FLOAT32 index.
|
|
||
| TopDocs topCentroids; | ||
| if (entry.fieldInfo.getVectorEncoding() == VectorEncoding.BYTE) { | ||
| byte[] byteTarget = new byte[target.length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused on the need for handling these cases - float target with byte vectors. Are you always storing the postings in byte format? (but I see a vice versa conversion as well). From what I know, the only place we have this currently is when vectors are explicitly stored as bytes, e.g. with scalar quantized vectors. In which case, the target is converted to bytes when it is quantized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the search side, the code enforces strict type matching. search()explicitly checks the field encoding and throws IllegalArgumentException if you try to query a FLOAT32 field with bytes or vice versa.
searchFine accepts both arguments solely to deduplicate the traversal implementation; it strictly relies on the segment's encoding to decide which target to use. There is no implicit conversion or quantization.
On the writer side, we lift byte vectors to floats in memory to run K-Means since centroid averaging requires continuous space, then cast back to bytes for the on-disk posting lists. The reader strictness matches the final on-disk format.
| /** | ||
| * Buffers vectors in memory until flush. | ||
| * | ||
| * <p>Future improvements could include off-heap or disk-backed buffering (e.g. ByteBlockPool) to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: How do you feel about keeping these ideas in github issues instead of code comments? IMO it's okay to call out a performance bottleneck TODO in comments, but naming specific techniques leaves future readers wondering whether and where they were implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong preference here. Will remove them.
| AcceptDocs acceptDocs) | ||
| throws IOException { | ||
| IndexInput dataIn = entry.dataIn.clone(); | ||
| for (ScoreDoc centroidDoc : topCentroids.scoreDocs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid looking at all the topK centroid postings, and short-circuit if dist(centroid, target) > nearest_centroid_distance + small_threshold ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. I've implemented dynamic pruning in searchFine. It now checks if (bestCentroidScore - currentScore > DYNAMIC_PRUNING_THRESHOLD) and skip the entire partition if it's significantly worse than the best candidate.
| // Cap the number of partitions at configured limit (default 100) | ||
| int numPartitions = Math.min(vectorArray.length, maxPartitions); | ||
|
|
||
| // Downsample to keep flush time constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Let's treat this downsampling as a "clustering impl" detail and move it into the clustering class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored the reservoir sampling logic out of the Writer and into SpannKMeans#downsample.
| float bestScore = Float.NEGATIVE_INFINITY; | ||
|
|
||
| for (int c = 0; c < centroids.length; c++) { | ||
| float score = simFunc.compare(vector, centroids[c]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k-means clustering would've done assigned some vectors to partitions already. Do we need to compare every vector against every centroid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only train on a sample of the flush (e.g. 16k vectors), treating the final assignment as a fresh pass over all vectors is simpler and safer than tracking partial pre-assignments. I prefer this simplicity for the initial version but will defer to your advise.
- Skip processing candidate partitions that fall below a score threshold relative to the best candidate, improving search latency. - Enforce consistency between query and index vector types; removes mixed-mode support to simplify search paths. - Benchmarks: Verified recall and memory usage (Hot vs Cold) via JMH. Benchmark output is made verbose to distinguish RSS-driving structures from streaming data. - Centralize vector downsampling in SpannKMeans. - Add parity tests to ANN correctness against brute-force search. Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
+1 -- luceneutil now supports Cohere v3 1024 dimension vectors, ~40 MM vectors, unit sphere normalized. |
tests and JMH open-reader benchmark Signed-off-by: Atri Sharma <[email protected]>
Signed-off-by: Atri Sharma <[email protected]>
Adds Lucene99SpannVectorsFormat, implementing Disk-Resident HNSW-IVF (SPANN) to support vector indices larger than available heap.
This PR segregatse the index into a Coarse Quantizer (Centroids in HNSW) and the actual Data (Disk-resident inverted lists).
Writer flow: Buffers vectors in heap, then runs K-Means++ on flush (using reservoir sampling for amortised performance). Writes centroids to the delegate format and vector data sequentially to .spad files.
Reader flow: Performs a two-phase search. First phase queries HNSW for the nearest nprobe partitions. Second phase uses the chosen centroids from first phase and scans the candidate partitions on disk.
Testing strategy includes unit tests for integrity, clustering correctness, and a recall validation test confirming that higher n-probe retrieves better results.
Note: Merging currently uses the default implementation and requires heap proportional to segment size. Disk-based merging is a follow-up.