feat: implemented tokenizer plugins in rust #5583

KazuhitoT · 2025-12-26T13:34:12Z

related to #3222

Key changes are as bellows:

include/lance_tokenizer_plugin.h: add C API for the tokenizer plugin
protos/index_old.proto: add two fields to restore the plugin tokenizer configuration
rust/lance-index/src/scalar/inverted/tokenizer.rs and rust/lance-index/src/scalar/inverted/plugin/*: implement tokenizer loading
rust/lance-index/examples/: add an example usage

During the PR creation process, I had two questions and left comments in the PR.

KazuhitoT · 2025-12-26T13:54:20Z

rust/lance-index/src/scalar/inverted/tokenizer/plugin/tokenizer.rs

+    fn create_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a> {
+        // Note: This is not the most efficient approach for repeated tokenization,
+        //       but it ensures thread safety and simplifies lifetime management.
+        //       For production use, consider caching the factory/tokenizer.
+        let stream = PluginTokenStreamAdapter::new(Arc::clone(&self.library), &self.config, text);
+        BoxTokenStream::new(stream)
+    }
+}


Creating the factory is expensive because it loads a tokenizer plugin (~*MB).
Should we cache the factory only, or cache both the factory and the tokenizer?

Probably just cache the factory.

KazuhitoT · 2025-12-26T13:59:02Z

rust/lance-index/src/scalar/inverted/tokenizer.rs

+        // Plugin tokenizer is handled separately as it returns LanceTokenizer directly
+        if self.base_tokenizer == "plugin" {
+            return self.build_plugin_tokenizer();
+        }
+


Since build_plugin_tokenizer() returns a LanceTokenizer, I added an early return instead of adding a new match condition in build_base_tokenizer(), which returns a TextAnalyzerBuilder.
Should I unify this branching logic?

I think so. One of the goals should be to be able to replace the Lindera and Jieba feature flags with plugins, without changing the output. So it seems like you'll want to place the base tokenizer inside of build_base_tokenizer.

codecov · 2025-12-26T14:22:13Z

Codecov Report

❌ Patch coverage is 79.62963% with 77 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...dex/src/scalar/inverted/tokenizer/plugin/loader.rs	70.18%	43 Missing and 5 partials ⚠️
.../src/scalar/inverted/tokenizer/plugin/tokenizer.rs	86.22%	22 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted/tokenizer.rs	79.16%	3 Missing and 2 partials ⚠️
...-index/src/scalar/inverted/tokenizer/plugin/ffi.rs	96.15%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

KazuhitoT · 2025-12-27T07:24:12Z

@wjones127 I have opened a draft PR #5583 mentioned in lancedb/lancedb#2855 (comment).
Could you take a look whenever you have time?

wjones127

Thanks for working on this. This is close to what I had in mind. I have various minor suggestions.

wjones127 · 2025-12-31T17:39:53Z

include/lance_tokenizer_plugin.h

+    uint32_t position;
+    uint32_t position_length;
+
+    /// Pointer to the token text (null-terminated UTF-8)


Do we want null terminated?

Not anymore. As mentioned later, CStringRef allows zero-copy with Rust's &str.

wjones127 · 2025-12-31T17:47:40Z

include/lance_tokenizer_plugin.h

This is going to mostly be used with Rust implementations, so using null-terminated strings doesn't seem like the right choice to me. Rust will have to do a memcpy to convert &str / String into null terminated strings.

I think it might be worth just defining a struct for a string reference like:

typedef struct StringReference { size_t start; size_t length; }

Then you can make a trivial conversion between that and Rust's &str.

I've implemented your definition (named CStringRef) for zero-copy string passing between Rust and the plugin.

wjones127 · 2025-12-31T17:58:59Z

include/lance_tokenizer_plugin.h

+    /// Create a tokenizer factory with the given JSON configuration.
+    ///
+    /// @param config_json JSON configuration string (UTF-8, null-terminated)
+    /// @param config_len Length of config_json in bytes (not including null terminator)
+    /// @return Factory handle, or NULL on error (call get_error for details)
+    LanceTokenizerFactory* (*create_factory)(const char* config_json, uint32_t config_len);


If you are going to parse config here, we should have some facility to pass back an error.

I'd honestly prefer we didn't use JSON here, and instead just had the caller make multiple calls to set_config(const StringReference key, const StringReference value). That way the plugin doesn't require a JSON parsing library.

Maybe the API should be like:

typedef struct Error { StringReference message; } Error; typedef struct LanceTokenizerPlugin { void (*create_factory)(LanceTokenizerFactory* out); void (*set_config)(const StringReference key, const StringReference value, Error* err); ... } LanceTokenizerPlugin;

At the initial implementation, I assumed the configuration format would be JSON.
However, I later realized that different plugins may require different formats—for example, jieba uses JSON, while lindera uses YAML.
Therefore, I decided that the configuration format should be defined by each plugin, and I left the configuration file loading behavior unchanged.
I also felt it was natural for clients to take responsibility if an error occurs while loading the tokenizer configuration.
What do you think about this approach?

Good point about Jieba and Lindera. If they both already take string options and each in a different format, then just having a string configuration input makes sense. 👍

wjones127 · 2025-12-31T18:03:46Z

include/lance_tokenizer_plugin.h

+    /// Get the last error message.
+    ///
+    /// @param factory Factory handle (can be NULL to get global/loading errors)
+    /// @return Error message (null-terminated), or NULL if no error
+    ///         The returned string is valid until the next error-generating call
+    const char* (*get_error)(LanceTokenizerFactory* factory);


I think we should return errors as an out parameter. It seems like the get_error style doesn't work well with multiple threads, and we definitely want to run the tokenizers in multiple threads. They are the most expensive part of creating an FTS index.

Thanks, I added Error* error arg.

wjones127 · 2025-12-31T18:13:17Z

rust/lance-index/src/scalar/inverted/tokenizer.rs

+        // Plugin tokenizer is handled separately as it returns LanceTokenizer directly
+        if self.base_tokenizer == "plugin" {
+            return self.build_plugin_tokenizer();
+        }
+


I think so. One of the goals should be to be able to replace the Lindera and Jieba feature flags with plugins, without changing the output. So it seems like you'll want to place the base tokenizer inside of build_base_tokenizer.

wjones127 · 2025-12-31T18:17:17Z

rust/lance-index/src/scalar/inverted/tokenizer/plugin/tokenizer.rs

+    fn create_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a> {
+        // Note: This is not the most efficient approach for repeated tokenization,
+        //       but it ensures thread safety and simplifies lifetime management.
+        //       For production use, consider caching the factory/tokenizer.
+        let stream = PluginTokenStreamAdapter::new(Arc::clone(&self.library), &self.config, text);
+        BoxTokenStream::new(stream)
+    }
+}


Probably just cache the factory.

wjones127 · 2025-12-31T18:17:56Z

protos/index_old.proto

+  optional string tokenizer_plugin_library = 12;
+  optional string tokenizer_plugin_config = 13;


This is sort of a change to the format, so we should probably open a design discussion on this. I can do that.

Could you add some comments to this part of the protobuf file? It should be clear to an implementor of a Lance library how these variables should be handled. How do we resolve the name of the dynamic library? What should the library do if the library is not found?

Started a discussion here: #5607

Thanks for opening a discussion.
I added some comments in proto.

I found your comment in discussion:

If the plugin dynamic library is not found, the index should be ignored for queries and not updated on writes. Users should get an error message when attempting to update the index that they need to install the correct plugin.

As commented in proto, the current implementation will throw an erorr if the dynamic library is not found.
However, this is a provisional implementation and will be revised according to the results of the discussion.

wjones127 · 2025-12-31T18:55:36Z

protos/index_old.proto

+  optional string tokenizer_plugin_library = 12;
+  optional string tokenizer_plugin_config = 13;


Maybe make config a map? Or do we think we'll have nested structures in there?

Suggested change

optional string tokenizer_plugin_library = 12;

optional string tokenizer_plugin_config = 13;

optional string tokenizer_plugin_library = 12;

optional map<string, string> tokenizer_plugin_config = 13;

As mentioned in #5583 (comment), I kept tokenizer_plugin_config as a string.

KazuhitoT added 5 commits December 16, 2025 00:25

feat: tokenizer plugin

edca06a

add e2e test

d85d958

add test

1e3dfb2

Fixed redundant comments

4c34f8d

fix plugin load

3c03047

github-actions bot added the enhancement New feature or request label Dec 26, 2025

KazuhitoT added 3 commits December 26, 2025 22:36

Merge branch 'main' into tokenizer-plugin

04c8f0b

fix settings.json

5888d59

cargo fmt & update Cargo.lock

45ac7c2

KazuhitoT commented Dec 26, 2025

View reviewed changes

wjones127 self-assigned this Dec 29, 2025

wjones127 reviewed Dec 31, 2025

View reviewed changes

KazuhitoT added 11 commits January 1, 2026 16:09

use LanceStringRef for zero-copy string

8f0f2a2

fix test and rename config_json to config

6930397

remove get_error() and pass back an error via *Error

1ac2972

move 'plugin' match case to build_base_tokenizer

97f8756

add factory cache

7d44897

fixed redundant comments and tests

ee043a9

update Cargo.lock

de090fb

add error handling for not found plugin path

5bdcdff

make config field required and add comments to proto

aba35d6

Merge branch 'main' into tokenizer-plugin

47cdba1

fix test

9899b52

		optional string tokenizer_plugin_library = 12;
		optional string tokenizer_plugin_config = 13;

feat: implemented tokenizer plugins in rust #5583

Are you sure you want to change the base?

feat: implemented tokenizer plugins in rust #5583

Uh oh!

Conversation

KazuhitoT commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 26, 2025

Codecov Report

Uh oh!

KazuhitoT commented Dec 27, 2025

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KazuhitoT commented Dec 26, 2025 •

edited

Loading