You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there, in the current Lance v2.1 implementation, when build_variable_width_compressor is used for compressing variable width data, specifying a general compression algorithm (e.g., ZSTD) in the field metadata can still result in fsst being applied internally due to the existing logic (see the code section below). I’m not fully sure whether this behavior is intentional. Technically it works, but conceptually it’s a bit confusing: none / fsst / zstd / lz4 appear to be parallel choices, yet fsst can be nested under a general compression algorithm. This nesting doesn’t seem obvious from the API.
Additionally, when an external general compressor such as zstd is applied, the random-access benefit of fsst is lost. The nested fsst + zstd approach may yield better compression ratios, but it also adds overhead during decompression, so there is a trade-off. I’m not sure this behavior is always desirable, and do you think if we need to add a configuration option to control it?
relevant code
Here is the relevant code. If the data qualifies for fsst, step 3 will select the fsst encoder, and in step 4 a general compression encoder may also be applied. In other words, fsst can be chosen first, and then an additional external compressor may be layered on top of it.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
description
Hi there, in the current Lance v2.1 implementation, when
build_variable_width_compressoris used for compressing variable width data, specifying a general compression algorithm (e.g., ZSTD) in the field metadata can still result infsstbeing applied internally due to the existing logic (see the code section below). I’m not fully sure whether this behavior is intentional. Technically it works, but conceptually it’s a bit confusing:none/fsst/zstd/lz4appear to be parallel choices, yetfsstcan be nested under a general compression algorithm. This nesting doesn’t seem obvious from the API.Additionally, when an external general compressor such as
zstdis applied, the random-access benefit offsstis lost. The nestedfsst+zstdapproach may yield better compression ratios, but it also adds overhead during decompression, so there is a trade-off. I’m not sure this behavior is always desirable, and do you think if we need to add a configuration option to control it?relevant code
Here is the relevant code. If the data qualifies for
fsst, step 3 will select thefsstencoder, and in step 4 a general compression encoder may also be applied. In other words,fsstcan be chosen first, and then an additional external compressor may be layered on top of it.https://github.com/lancedb/lance/blob/254a8217ac26666585983aa7ec8c4234f4c3f99f/rust/lance-encoding/src/compression.rs#L378-L405
Beta Was this translation helpful? Give feedback.
All reactions