FTS Tokenizer Plugins #5607
wjones127
started this conversation in
Lance Table Format
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
Users want to be able to use more tokenizers, such as Jieba or Lindera. However, adding those can add 10MB+ to the binary sizes 1. Right now we support them as optional features, which works for Rust users, but not for Python / Java users, who get pre-compiled artifacts and thus can't change the feature flags. So instead, we'd like to have a plugin system based on dynamic libraries.
Building and distribution
Given they implement a stable C API, they don't need to be released on same cadence as main packages. They can be maintained in a repo under
lance-formatand released on their own schedule.We can upload the shared libraries to GitHub releases on that repo. The repo should provide download instructions to install the plugin.
Automatic installation / plugin registry
Down the road, we might want a plugin registry. DuckDB does this for it's extensions2, for example. However, this brings in security concerns, such as the need to sign and validate binaries. So we will stick to simple downloads for now.
Index Behavior
Add two new fields to protobuf for
InvertedIndexDetailsIf an index is found with this info, the library should look for a tokenizer plugin with that name.
If the plugin dynamic library is not found, the index should be ignored for queries and not updated on writes. Users should get an error message when attempting to update the index that they need to install the correct plugin.
Implementation
A draft implementation is at #5583
Footnotes
https://github.com/lancedb/lancedb/pull/2855#issuecomment-3629653080 ↩
https://duckdb.org/docs/stable/extensions/overview ↩
Beta Was this translation helpful? Give feedback.
All reactions