Skip to content

Conversation

@KazuhitoT
Copy link
Contributor

@KazuhitoT KazuhitoT commented Dec 26, 2025

related to #3222

Key changes are as bellows:

  • include/lance_tokenizer_plugin.h: add C API for the tokenizer plugin
  • protos/index_old.proto: add two fields to restore the plugin tokenizer configuration
  • rust/lance-index/src/scalar/inverted/tokenizer.rs and rust/lance-index/src/scalar/inverted/plugin/*: implement tokenizer loading
  • rust/lance-index/examples/: add an example usage

During the PR creation process, I had two questions and left comments in the PR.

@github-actions github-actions bot added the enhancement New feature or request label Dec 26, 2025
Comment on lines 48 to 55
fn create_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a> {
// Note: This is not the most efficient approach for repeated tokenization,
// but it ensures thread safety and simplifies lifetime management.
// For production use, consider caching the factory/tokenizer.
let stream = PluginTokenStreamAdapter::new(Arc::clone(&self.library), &self.config, text);
BoxTokenStream::new(stream)
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating the factory is expensive because it loads a tokenizer plugin (~*MB).
Should we cache the factory only, or cache both the factory and the tokenizer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just cache the factory.

Comment on lines 313 to 317
// Plugin tokenizer is handled separately as it returns LanceTokenizer directly
if self.base_tokenizer == "plugin" {
return self.build_plugin_tokenizer();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since build_plugin_tokenizer() returns a LanceTokenizer, I added an early return instead of adding a new match condition in build_base_tokenizer(), which returns a TextAnalyzerBuilder.
Should I unify this branching logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. One of the goals should be to be able to replace the Lindera and Jieba feature flags with plugins, without changing the output. So it seems like you'll want to place the base tokenizer inside of build_base_tokenizer.

@codecov
Copy link

codecov bot commented Dec 26, 2025

@KazuhitoT
Copy link
Contributor Author

@wjones127 I have opened a draft PR #5583 mentioned in lancedb/lancedb#2855 (comment).
Could you take a look whenever you have time?

@wjones127 wjones127 self-assigned this Dec 29, 2025
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. This is close to what I had in mind. I have various minor suggestions.

uint32_t position;
uint32_t position_length;

/// Pointer to the token text (null-terminated UTF-8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want null terminated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not anymore. As mentioned later, CStringRef allows zero-copy with Rust's &str.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to mostly be used with Rust implementations, so using null-terminated strings doesn't seem like the right choice to me. Rust will have to do a memcpy to convert &str / String into null terminated strings.

I think it might be worth just defining a struct for a string reference like:

typedef struct StringReference {
    size_t start;
    size_t length;
}

Then you can make a trivial conversion between that and Rust's &str.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented your definition (named CStringRef) for zero-copy string passing between Rust and the plugin.

Comment on lines 39 to 44
/// Create a tokenizer factory with the given JSON configuration.
///
/// @param config_json JSON configuration string (UTF-8, null-terminated)
/// @param config_len Length of config_json in bytes (not including null terminator)
/// @return Factory handle, or NULL on error (call get_error for details)
LanceTokenizerFactory* (*create_factory)(const char* config_json, uint32_t config_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If you are going to parse config here, we should have some facility to pass back an error.
  2. I'd honestly prefer we didn't use JSON here, and instead just had the caller make multiple calls to set_config(const StringReference key, const StringReference value). That way the plugin doesn't require a JSON parsing library.

Maybe the API should be like:

typedef struct Error {
    StringReference message;
} Error;

typedef struct LanceTokenizerPlugin {
    void (*create_factory)(LanceTokenizerFactory* out);
    void (*set_config)(const StringReference key, const StringReference value, Error* err);
    ...
} LanceTokenizerPlugin;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the initial implementation, I assumed the configuration format would be JSON.
However, I later realized that different plugins may require different formats—for example, jieba uses JSON, while lindera uses YAML.
Therefore, I decided that the configuration format should be defined by each plugin, and I left the configuration file loading behavior unchanged.
I also felt it was natural for clients to take responsibility if an error occurs while loading the tokenizer configuration.
What do you think about this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about Jieba and Lindera. If they both already take string options and each in a different format, then just having a string configuration input makes sense. 👍

Comment on lines 84 to 89
/// Get the last error message.
///
/// @param factory Factory handle (can be NULL to get global/loading errors)
/// @return Error message (null-terminated), or NULL if no error
/// The returned string is valid until the next error-generating call
const char* (*get_error)(LanceTokenizerFactory* factory);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return errors as an out parameter. It seems like the get_error style doesn't work well with multiple threads, and we definitely want to run the tokenizers in multiple threads. They are the most expensive part of creating an FTS index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added Error* error arg.

Comment on lines 313 to 317
// Plugin tokenizer is handled separately as it returns LanceTokenizer directly
if self.base_tokenizer == "plugin" {
return self.build_plugin_tokenizer();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. One of the goals should be to be able to replace the Lindera and Jieba feature flags with plugins, without changing the output. So it seems like you'll want to place the base tokenizer inside of build_base_tokenizer.

Comment on lines 48 to 55
fn create_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a> {
// Note: This is not the most efficient approach for repeated tokenization,
// but it ensures thread safety and simplifies lifetime management.
// For production use, consider caching the factory/tokenizer.
let stream = PluginTokenStreamAdapter::new(Arc::clone(&self.library), &self.config, text);
BoxTokenStream::new(stream)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just cache the factory.

Comment on lines 42 to 43
optional string tokenizer_plugin_library = 12;
optional string tokenizer_plugin_config = 13;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of a change to the format, so we should probably open a design discussion on this. I can do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments to this part of the protobuf file? It should be clear to an implementor of a Lance library how these variables should be handled. How do we resolve the name of the dynamic library? What should the library do if the library is not found?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started a discussion here: #5607

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening a discussion.
I added some comments in proto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found your comment in discussion:

If the plugin dynamic library is not found, the index should be ignored for queries and not updated on writes. Users should get an error message when attempting to update the index that they need to install the correct plugin.

As commented in proto, the current implementation will throw an erorr if the dynamic library is not found.
However, this is a provisional implementation and will be revised according to the results of the discussion.

Comment on lines 42 to 43
optional string tokenizer_plugin_library = 12;
optional string tokenizer_plugin_config = 13;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make config a map? Or do we think we'll have nested structures in there?

Suggested change
optional string tokenizer_plugin_library = 12;
optional string tokenizer_plugin_config = 13;
optional string tokenizer_plugin_library = 12;
optional map<string, string> tokenizer_plugin_config = 13;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #5583 (comment), I kept tokenizer_plugin_config as a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants