-
Notifications
You must be signed in to change notification settings - Fork 510
feat: add classic alp encoding support #5587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Code Review for Classic ALP Encoding SupportSummaryThis PR adds classic ALP (Adaptive Lossless floating-Point) encoding for f32/f64 data as a miniblock compressor. The implementation includes exponent selection, delta-based bitpacking, and exception handling for non-encodable values. P1 Issues1. Potential integer overflow in The "sweet spot" rounding technique can produce incorrect results for values outside the representable integer range: let rounded = fast_round_f32(scaled);
let enc = rounded as i32; // Can overflow if scaled is outside i32 rangeWhile you check 2. Missing feature gate consistency The
3. Exception position overflow risk Exception positions are stored as exception_positions.push(i as u16); // Safe only if i < 65536Consider adding a debug_assert or making the assumption explicit in the code. Suggestions (Non-blocking)
Overall this is a well-documented implementation with proper fallback handling and bitwise-lossless guarantees. |
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exciting!
Have you benchmarked the throughput of this and done any profiling? The sampling of e and f seems pretty heavy, so maybe worth spending some time optimizing.
Also, could you run the fuzz_test.rs on this, and do it for a longer period to check correctness? I think that's just running against 2.1 right now.
Also, I'm not sure it's from this PR, but cargo check -p lance-encoding --no-default-features does currently fail due to some bitpacking code not being gated properly.
| if num_values <= sample_size { | ||
| return (0..num_values).collect(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you should have some value like None or empty Vec that should be interpreted as "sample all rows", that way you can avoid the allocation in this common case.
| if data.len() != 4 { | ||
| return Err(Error::invalid_input( | ||
| format!("ALP decompression expects 4 buffers, got {}", data.len()), | ||
| location!(), | ||
| )); | ||
| } | ||
|
|
||
| let n = usize::try_from(num_values) | ||
| .map_err(|_| Error::invalid_input("ALP chunk too large for usize", location!()))?; | ||
|
|
||
| let mut iter = data.into_iter(); | ||
| let packed_deltas = iter.next().unwrap(); | ||
| let header = iter.next().unwrap(); | ||
| let exception_positions = iter.next().unwrap(); | ||
| let exception_bits = iter.next().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you should be able to do the length check and unpack the array in one go. The upshot of this is that the compile doesn't need to put a panic message in the binary for each call to .next().unwrap().
| if data.len() != 4 { | |
| return Err(Error::invalid_input( | |
| format!("ALP decompression expects 4 buffers, got {}", data.len()), | |
| location!(), | |
| )); | |
| } | |
| let n = usize::try_from(num_values) | |
| .map_err(|_| Error::invalid_input("ALP chunk too large for usize", location!()))?; | |
| let mut iter = data.into_iter(); | |
| let packed_deltas = iter.next().unwrap(); | |
| let header = iter.next().unwrap(); | |
| let exception_positions = iter.next().unwrap(); | |
| let exception_bits = iter.next().unwrap(); | |
| let [packed_deltas, header, exception_positions, exception_bits] = | |
| data.try_into().map_err(|_| Error::invalid_input( | |
| format!("ALP decompression expects 4 buffers, got {}", data.len()), | |
| location!(), | |
| ))?; | |
| let n = usize::try_from(num_values) | |
| .map_err(|_| Error::invalid_input("ALP chunk too large for usize", location!()))?; |
This PR will add classic alp encoding support
Parts of this PR were drafted with assistance from Codex (with
gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.