Skip to content

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Dec 29, 2025

This PR will add classic alp encoding support


Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions github-actions bot added the enhancement New feature or request label Dec 29, 2025
@github-actions
Copy link
Contributor

Code Review for Classic ALP Encoding Support

Summary

This PR adds classic ALP (Adaptive Lossless floating-Point) encoding for f32/f64 data as a miniblock compressor. The implementation includes exponent selection, delta-based bitpacking, and exception handling for non-encodable values.

P1 Issues

1. Potential integer overflow in fast_round_f32/fast_round_f64 transformations

The "sweet spot" rounding technique can produce incorrect results for values outside the representable integer range:

let rounded = fast_round_f32(scaled);
let enc = rounded as i32;  // Can overflow if scaled is outside i32 range

While you check is_finite() and the scaling factor validity, values like f32::MAX * 10^e scaled back could produce values that are finite but exceed i32::MAX/MIN. This would cause silent truncation. Consider adding explicit range checks before the cast, or using try_from semantics.

2. Missing feature gate consistency

The AlpMiniBlockEncoder struct and its MiniBlockCompressor impl are not gated by #[cfg(feature = "bitpacking")], but they call encode_chunk_f32/encode_chunk_f64 which return errors when bitpacking is disabled. This creates a runtime error rather than a compile-time check. Consider either:

  • Gating the entire encoder behind the feature flag, or
  • Documenting this behavior explicitly

3. Exception position overflow risk

Exception positions are stored as u16, but chunk sizes are validated only by max_chunk_size() (1024 for f32, 512 for f64). While this fits in u16, there's no explicit assertion that chunk_values <= u16::MAX at the position recording site:

exception_positions.push(i as u16);  // Safe only if i < 65536

Consider adding a debug_assert or making the assumption explicit in the code.

Suggestions (Non-blocking)

  • The duplicate logic between f32 and f64 encoding paths is significant (~200+ lines). A generic implementation with trait bounds could reduce maintenance burden, though this may be intentional for performance.

  • Consider adding a benchmark test for ALP encoding to validate compression ratios and performance characteristics on representative datasets.

Overall this is a well-documented implementation with proper fallback handling and bitwise-lossless guarantees.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exciting!

Have you benchmarked the throughput of this and done any profiling? The sampling of e and f seems pretty heavy, so maybe worth spending some time optimizing.

Also, could you run the fuzz_test.rs on this, and do it for a longer period to check correctness? I think that's just running against 2.1 right now.

Also, I'm not sure it's from this PR, but cargo check -p lance-encoding --no-default-features does currently fail due to some bitpacking code not being gated properly.

Comment on lines +119 to +121
if num_values <= sample_size {
return (0..num_values).collect();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you should have some value like None or empty Vec that should be interpreted as "sample all rows", that way you can avoid the allocation in this common case.

Comment on lines +921 to +935
if data.len() != 4 {
return Err(Error::invalid_input(
format!("ALP decompression expects 4 buffers, got {}", data.len()),
location!(),
));
}

let n = usize::try_from(num_values)
.map_err(|_| Error::invalid_input("ALP chunk too large for usize", location!()))?;

let mut iter = data.into_iter();
let packed_deltas = iter.next().unwrap();
let header = iter.next().unwrap();
let exception_positions = iter.next().unwrap();
let exception_bits = iter.next().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you should be able to do the length check and unpack the array in one go. The upshot of this is that the compile doesn't need to put a panic message in the binary for each call to .next().unwrap().

Suggested change
if data.len() != 4 {
return Err(Error::invalid_input(
format!("ALP decompression expects 4 buffers, got {}", data.len()),
location!(),
));
}
let n = usize::try_from(num_values)
.map_err(|_| Error::invalid_input("ALP chunk too large for usize", location!()))?;
let mut iter = data.into_iter();
let packed_deltas = iter.next().unwrap();
let header = iter.next().unwrap();
let exception_positions = iter.next().unwrap();
let exception_bits = iter.next().unwrap();
let [packed_deltas, header, exception_positions, exception_bits] =
data.try_into().map_err(|_| Error::invalid_input(
format!("ALP decompression expects 4 buffers, got {}", data.len()),
location!(),
))?;
let n = usize::try_from(num_values)
.map_err(|_| Error::invalid_input("ALP chunk too large for usize", location!()))?;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants