Skip to content

feat(puffin): add format constants, utilities, and JSON serialization#603

Open
zhaoxuan1994 wants to merge 1 commit intoapache:mainfrom
zhaoxuan1994:feat/puffin-format-utils
Open

feat(puffin): add format constants, utilities, and JSON serialization#603
zhaoxuan1994 wants to merge 1 commit intoapache:mainfrom
zhaoxuan1994:feat/puffin-format-utils

Conversation

@zhaoxuan1994
Copy link
Contributor

Second PR in the Puffin series (following #588).

puffin_format.h/.cc: format constants, footer flag operations, little-endian I/O, compress/decompress dispatch (LZ4/Zstd stubbed as NotSupported)
puffin_json_internal.h/.cc: JSON serialization for BlobMetadata and FileMetadata

Copilot AI review requested due to automatic review settings March 24, 2026 07:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds core Puffin file-format helpers and JSON (de)serialization support for Puffin metadata, plus unit tests and build-system wiring to compile/run the new components.

Changes:

  • Introduce puffin_format constants/utilities (footer flags, little-endian I/O, compression dispatch stubs).
  • Add puffin_json_internal JSON serialization/deserialization for BlobMetadata and FileMetadata.
  • Add puffin_test and register it in both CMake and Meson test targets.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/iceberg/puffin/puffin_format.h Declares Puffin format constants and helper APIs (flags, endian I/O, compress/decompress).
src/iceberg/puffin/puffin_format.cc Implements flags/endian helpers and compression dispatch stubs.
src/iceberg/puffin/puffin_json_internal.h Declares JSON (de)serialization APIs for Puffin metadata.
src/iceberg/puffin/puffin_json_internal.cc Implements JSON (de)serialization and string parsing for FileMetadata.
src/iceberg/test/puffin_test.cc Adds unit tests for endian/flags/compression stubs and JSON round-trips.
src/iceberg/test/CMakeLists.txt Registers puffin_test in CMake-based test build.
src/iceberg/test/meson.build Registers puffin_test in Meson-based test build.
src/iceberg/meson.build Adds new Puffin .cc sources to the main Meson library build.
src/iceberg/CMakeLists.txt Adds new Puffin .cc sources to the main CMake library build.
src/iceberg/puffin/meson.build Updates Puffin header installation list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


/// Default compression codec for footer payload.
static constexpr PuffinCompressionCodec kFooterCompressionCodec =
PuffinCompressionCodec::kLz4;
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PuffinFormat::kFooterCompressionCodec is set to kLz4, but Compress/Decompress currently return NotSupported for LZ4. This constant is likely to be used as the default when writing/reading footer payloads, and will cause immediate failures once that code lands. Consider defaulting to kNone until LZ4 support is implemented, or clearly documenting that the default codec is not yet supported and must not be used for writing.

Suggested change
PuffinCompressionCodec::kLz4;
PuffinCompressionCodec::kNone;

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping kLz4 as it reflects the Puffin spec's default footer compression codec, consistent with the Java implementation. This constant is a spec-level declaration — actual usage in the writer will check the Result from Compress(), so unsupported codecs surface as explicit errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually a real compatibility concern where java puffin readers would likely break if a compressed footer was ever properly written. Java has never implemented lz4 compression: https://github.com/apache/iceberg/blob/8a51a685894eace51474f512b46a187566f27202/core/src/main/java/org/apache/iceberg/puffin/PuffinFormat.java#L112 which IIUC means anything written with a compressed footer would not be readable from Java. If you agree with this assessment it might be misreading the code but might be worth a discussion on the mailing list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking more closely it looks like java would just throw unsupported codec for compressed footers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the spec calls for lz4, so there is maybe nothing to be done here

@zhaoxuan1994 zhaoxuan1994 force-pushed the feat/puffin-format-utils branch 2 times, most recently from 18a4dcd to 39d4a91 Compare March 24, 2026 07:33
ICEBERG_EXPORT void WriteInt32LittleEndian(int32_t value, std::span<uint8_t, 4> output);

/// \brief Read a 32-bit integer from a fixed-size span in little-endian format.
ICEBERG_EXPORT int32_t ReadInt32LittleEndian(std::span<const uint8_t, 4> input);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is very code for this in the roaring bitmap code it should probably standardized in a common place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added WriteLittleEndian/ReadLittleEndian templates to endian.h as public API. Updated both puffin_format.cc and roaring_position_bitmap.cc to use them, removing the duplicated local helpers.


/// \brief Decompress data using the specified codec.
ICEBERG_EXPORT Result<std::vector<uint8_t>> Decompress(PuffinCompressionCodec codec,
std::span<const uint8_t> input);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roaring bitmap is using string_view as collection of bytes, we should probably standardize on one of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used span here since it feels more natural for binary data. But I can see the argument for string_view given that CRoaring already uses std::string for serialization. Do you have a preference on which one the project should standardize on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't. Another things to decide is if we are using C++17 or above if std::byte should be used instead. This is probably something that should be discussed on the mailing list I would guess, or at least get @wgtmac and other committers opinion on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're on C++23 so std::byte is definitely available. Together with my previous comment for a generic codec interface/implementation, we can use std::span<const std::byte> as input and std::vector<std::byte> as output but with overloads that accepting std::string_view and returning std::string, etc.


/// \brief Deserialize a FileMetadata from a JSON string.
ICEBERG_EXPORT Result<FileMetadata> FileMetadataFromJsonString(
const std::string& json_string);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: std::string_view?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like tests should be seperated per .cc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@zhaoxuan1994 zhaoxuan1994 force-pushed the feat/puffin-format-utils branch from 39d4a91 to 2ec6aa5 Compare March 26, 2026 07:01
@zhaoxuan1994 zhaoxuan1994 force-pushed the feat/puffin-format-utils branch from 2ec6aa5 to abc7aa8 Compare March 26, 2026 07:23
partition_summary.cc
puffin/file_metadata.cc
puffin/puffin_format.cc
puffin/puffin_json_internal.cc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename to puffin/json_serde_internal.h and puffin/json_serde.cc respectively for consistency.

/// \brief Puffin file format constants.
struct ICEBERG_EXPORT PuffinFormat {
/// Magic bytes: "PFA1" (Puffin Fratercula arctica, version 1)
static constexpr std::array<uint8_t, 4> kMagic = {0x50, 0x46, 0x41, 0x31};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static constexpr std::array<uint8_t, 4> kMagic = {0x50, 0x46, 0x41, 0x31};
static constexpr std::array<uint8_t, 4> kMagicV1 = {0x50, 0x46, 0x41, 0x31};

Let's make V1 explicit.

static constexpr int32_t kFooterStructLength = kFooterStructMagicOffset + kMagicLength;

/// Default compression codec for footer payload.
static constexpr PuffinCompressionCodec kFooterCompressionCodec =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static constexpr PuffinCompressionCodec kFooterCompressionCodec =
static constexpr PuffinCompressionCodec kDefaultFooterCompressionCodec =

Comment on lines +128 to +130
add_iceberg_test(puffin_format_test SOURCES puffin_format_test.cc)

add_iceberg_test(puffin_json_test SOURCES puffin_json_test.cc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge them into a single puffin_test executable?

int32_t offset);

/// \brief Compress data using the specified codec.
ICEBERG_EXPORT Result<std::vector<uint8_t>> Compress(PuffinCompressionCodec codec,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the right place to put compression logic here as this can be shared by TableMetadata compression as well. We can consider a single place to unify the codec implementation (as well as the codec availability).

}

int32_t ReadInt32LittleEndian(std::span<const uint8_t> data, int32_t offset) {
ICEBERG_DCHECK(offset >= 0, "Offset must be non-negative");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ICEBERG_DCHECK is a debug check and should only be used in a rare case. If this is a generic function and cannot guarantee the input, we should return Result<int32_t> and replace it by ICEBERG_PRECHECK here.


TEST(PuffinFormatTest, ByteOrderRoundTrip) {
std::array<uint8_t, 4> buf{};
WriteInt32LittleEndian(0x12345678, buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these functions be moved to endian.h as they are not puffin specific?

namespace iceberg::puffin {

TEST(PuffinJsonTest, BlobMetadataRoundTrip) {
BlobMetadata blob;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use aggregate initialization.

Comment on lines +55 to +56
ASSERT_TRUE(result.has_value());
EXPECT_EQ(result.value(), blob);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: merge these two checks into ASSERT_THAT


namespace iceberg::puffin {

TEST(PuffinJsonTest, BlobMetadataRoundTrip) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can improve these test cases by refactoring them into parameterized test with more valid and invalid cases. A lot of other test files have done similar jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants