-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Bulk score hnsw diversity check #15607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| float neighborSimilarity = scorer.score(neighbors.nodes()[i]); | ||
| if (neighborSimilarity >= score) { | ||
| return false; | ||
| bulkScoreNodes[bulkCount++] = neighbors.nodes()[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, doing 8 always, even if there are only 8 connections seems foolish. I will benchmark with Math.min((neighbors.nodes().length + 1)/2, bulkScoreNodes.length)).
kaivalnp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a nice optimization!
| private boolean diversityCheck(float score, NeighborArray neighbors, RandomVectorScorer scorer) | ||
| throws IOException { | ||
| int bulkCount = 0; | ||
| final int bulkScoreChunk = Math.min((neighbors.nodes().length + 1) / 2, bulkScoreNodes.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC neighbors.nodes() returns the internal array used to store neighbor nodes, which can be larger than actual number of neighbors and is exponentially growing -- can we have edge cases where the length is about twice the actual number of neighbors, causing us to bulk score everything?
I wonder if we should use neighbors.size() instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call! And good call on the array copy!, Just needed to handle the tail. New benchmarks show even nicer perf ;)
| private final int[] bulkScoreNodes; // for bulk scoring | ||
| private final float[] bulkScores; // for bulk scoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried about thread-safety of these arrays (given we have concurrent merging), but from this comment it looks like instances of this class are not shared across threads, but rather multiple instances of this class (across different threads) can operate on a single HnswGraph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaivalnp the scorer itself isn't threadsafe. I assumed that since we were using a scorer, we were OK.
I had the same threading concerns and looked up and it seems that each thread as a unique builder object instance (The worker of the thread) and they all work on the same graph.
| throws IOException { | ||
| int bulkCount = 0; | ||
| final int bulkScoreChunk = Math.min((neighbors.nodes().length + 1) / 2, bulkScoreNodes.length); | ||
| for (int i = 0; i < neighbors.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we don't need to perform any operation per-node of this loop, can we reduce some iterations using something like:
private boolean diversityCheck(float score, NeighborArray neighbors, RandomVectorScorer scorer)
throws IOException {
final int chunk = Math.min(Math.ceilDiv(neighbors.size(), 2), bulkScoreNodes.length);
for (int start = 0; start < neighbors.size(); start += chunk) {
int length = Math.min(neighbors.size() - start, chunk);
System.arraycopy(neighbors.nodes(), start, bulkScoreNodes, 0, length);
if (scorer.bulkScore(bulkScoreNodes, bulkScores, length) >= score) {
return false;
}
}
return true;
}
kaivalnp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @benwtrent! Curious if you were able to run knnPerfTest?
| // handle a tail | ||
| if (scored < neighbors.size()) { | ||
| int chunkSize = neighbors.size() - scored; | ||
| System.arraycopy(neighbors.nodes(), scored, bulkScoreNodes, 0, chunkSize); | ||
| if (scorer.bulkScore(bulkScoreNodes, bulkScores, chunkSize) >= score) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this tail -- we're doing Math.min(bulkScoreChunk, neighbors.size() - scored) in the above loop, which automatically bulk-scores the tail (using the second value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good call 🤦 its reflex.
| this.bulkScoreNodes = new int[8]; | ||
| this.bulkScores = new float[8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe host 8 up as a private static final variable?
|
@kaivalnp I did run with 768 dim vectors (see PR description). I ran it again and I get perf around 77.04s for force merge to 90s force merge. Force merge performance is pretty variable, but with all my runs, this provides a measurable speed improvement. |
kaivalnp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This adds bulk scoring to diversity check. While this means that diversity check cannot exit super early (e.g. if it only needs to check 2 docs), I continually see diversity check as being the most expensive part of HNSW graph merging.
This tells me that typically, it isn't just one doc that is checked.
I ran 1M 768 cohere, force-merging with 4 threads.
baseline: 128.01
candidate: 92.15