-
Notifications
You must be signed in to change notification settings - Fork 510
feat: implemented tokenizer plugins in rust #5583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| fn create_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a> { | ||
| // Note: This is not the most efficient approach for repeated tokenization, | ||
| // but it ensures thread safety and simplifies lifetime management. | ||
| // For production use, consider caching the factory/tokenizer. | ||
| let stream = PluginTokenStreamAdapter::new(Arc::clone(&self.library), &self.config, text); | ||
| BoxTokenStream::new(stream) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating the factory is expensive because it loads a tokenizer plugin (~*MB).
Should we cache the factory only, or cache both the factory and the tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably just cache the factory.
| // Plugin tokenizer is handled separately as it returns LanceTokenizer directly | ||
| if self.base_tokenizer == "plugin" { | ||
| return self.build_plugin_tokenizer(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since build_plugin_tokenizer() returns a LanceTokenizer, I added an early return instead of adding a new match condition in build_base_tokenizer(), which returns a TextAnalyzerBuilder.
Should I unify this branching logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. One of the goals should be to be able to replace the Lindera and Jieba feature flags with plugins, without changing the output. So it seems like you'll want to place the base tokenizer inside of build_base_tokenizer.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
@wjones127 I have opened a draft PR #5583 mentioned in lancedb/lancedb#2855 (comment). |
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. This is close to what I had in mind. I have various minor suggestions.
include/lance_tokenizer_plugin.h
Outdated
| uint32_t position; | ||
| uint32_t position_length; | ||
|
|
||
| /// Pointer to the token text (null-terminated UTF-8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want null terminated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not anymore. As mentioned later, CStringRef allows zero-copy with Rust's &str.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to mostly be used with Rust implementations, so using null-terminated strings doesn't seem like the right choice to me. Rust will have to do a memcpy to convert &str / String into null terminated strings.
I think it might be worth just defining a struct for a string reference like:
typedef struct StringReference {
size_t start;
size_t length;
}Then you can make a trivial conversion between that and Rust's &str.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented your definition (named CStringRef) for zero-copy string passing between Rust and the plugin.
include/lance_tokenizer_plugin.h
Outdated
| /// Create a tokenizer factory with the given JSON configuration. | ||
| /// | ||
| /// @param config_json JSON configuration string (UTF-8, null-terminated) | ||
| /// @param config_len Length of config_json in bytes (not including null terminator) | ||
| /// @return Factory handle, or NULL on error (call get_error for details) | ||
| LanceTokenizerFactory* (*create_factory)(const char* config_json, uint32_t config_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If you are going to parse config here, we should have some facility to pass back an error.
- I'd honestly prefer we didn't use JSON here, and instead just had the caller make multiple calls to
set_config(const StringReference key, const StringReference value). That way the plugin doesn't require a JSON parsing library.
Maybe the API should be like:
typedef struct Error {
StringReference message;
} Error;
typedef struct LanceTokenizerPlugin {
void (*create_factory)(LanceTokenizerFactory* out);
void (*set_config)(const StringReference key, const StringReference value, Error* err);
...
} LanceTokenizerPlugin;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the initial implementation, I assumed the configuration format would be JSON.
However, I later realized that different plugins may require different formats—for example, jieba uses JSON, while lindera uses YAML.
Therefore, I decided that the configuration format should be defined by each plugin, and I left the configuration file loading behavior unchanged.
I also felt it was natural for clients to take responsibility if an error occurs while loading the tokenizer configuration.
What do you think about this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about Jieba and Lindera. If they both already take string options and each in a different format, then just having a string configuration input makes sense. 👍
include/lance_tokenizer_plugin.h
Outdated
| /// Get the last error message. | ||
| /// | ||
| /// @param factory Factory handle (can be NULL to get global/loading errors) | ||
| /// @return Error message (null-terminated), or NULL if no error | ||
| /// The returned string is valid until the next error-generating call | ||
| const char* (*get_error)(LanceTokenizerFactory* factory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return errors as an out parameter. It seems like the get_error style doesn't work well with multiple threads, and we definitely want to run the tokenizers in multiple threads. They are the most expensive part of creating an FTS index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I added Error* error arg.
| // Plugin tokenizer is handled separately as it returns LanceTokenizer directly | ||
| if self.base_tokenizer == "plugin" { | ||
| return self.build_plugin_tokenizer(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. One of the goals should be to be able to replace the Lindera and Jieba feature flags with plugins, without changing the output. So it seems like you'll want to place the base tokenizer inside of build_base_tokenizer.
| fn create_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a> { | ||
| // Note: This is not the most efficient approach for repeated tokenization, | ||
| // but it ensures thread safety and simplifies lifetime management. | ||
| // For production use, consider caching the factory/tokenizer. | ||
| let stream = PluginTokenStreamAdapter::new(Arc::clone(&self.library), &self.config, text); | ||
| BoxTokenStream::new(stream) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably just cache the factory.
| optional string tokenizer_plugin_library = 12; | ||
| optional string tokenizer_plugin_config = 13; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is sort of a change to the format, so we should probably open a design discussion on this. I can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some comments to this part of the protobuf file? It should be clear to an implementor of a Lance library how these variables should be handled. How do we resolve the name of the dynamic library? What should the library do if the library is not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started a discussion here: #5607
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening a discussion.
I added some comments in proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found your comment in discussion:
If the plugin dynamic library is not found, the index should be ignored for queries and not updated on writes. Users should get an error message when attempting to update the index that they need to install the correct plugin.
As commented in proto, the current implementation will throw an erorr if the dynamic library is not found.
However, this is a provisional implementation and will be revised according to the results of the discussion.
| optional string tokenizer_plugin_library = 12; | ||
| optional string tokenizer_plugin_config = 13; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make config a map? Or do we think we'll have nested structures in there?
| optional string tokenizer_plugin_library = 12; | |
| optional string tokenizer_plugin_config = 13; | |
| optional string tokenizer_plugin_library = 12; | |
| optional map<string, string> tokenizer_plugin_config = 13; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in #5583 (comment), I kept tokenizer_plugin_config as a string.
related to #3222
Key changes are as bellows:
include/lance_tokenizer_plugin.h: add C API for the tokenizer pluginprotos/index_old.proto: add two fields to restore the plugin tokenizer configurationrust/lance-index/src/scalar/inverted/tokenizer.rsandrust/lance-index/src/scalar/inverted/plugin/*: implement tokenizer loadingrust/lance-index/examples/: add an example usageDuring the PR creation process, I had two questions and left comments in the PR.