Spark: Support writing shredded variant in Iceberg-Spark by aihuaxu · Pull Request #14297 · apache/iceberg

aihuaxu · 2025-10-11T21:02:14Z

This change adds support for writing shredded variants in the iceberg-spark module, enabling Spark to write shredded variant data into Iceberg tables.

Ideally, this should follow the approach described in the reader/writer API proposal for Iceberg V4, where the execution engine provides the shredded writer schema before creating the Iceberg writer. This design is cleaner, as it delegates schema generation responsibility to the engine.

As an interim solution, this PR implements a writer with lazy initialization for the actual Parquet writer. It buffers a portion of the data first, derives the shredded schema from the buffered records, then initializes the Parquet writer and flushes the buffered data to the file.

The current shredding algorithm is to shred to the most common type for a field.

aihuaxu · 2025-10-15T01:57:15Z

@amogh-jahagirdar @Fokko @huaxingao Can you help take a look at this PR and if we have better approach for this?

aihuaxu · 2025-10-21T04:59:31Z

cc @RussellSpitzer, @pvary and @rdblue Seems it's better to have the implementation with new File Format proposal but want to check if this is acceptable approach as an interim solution or you see a better alternative.

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java

pvary · 2025-10-21T10:01:30Z

@aihuaxu: Don't we want to do the same but instead of wrapping the ParquetWriter, we could wrap the DataWriter. The schema would be created near the SparkWrite.WriterFactory and it would be easier to move to the new API when it is ready. The added benefit would be that when other formats implement the Variant, we could reuse the code.

Would this be prohibitively complex?

huaxingao · 2025-10-21T18:32:50Z

In Spark DSv2, planning/validation happens on the driver. BatchWrite#createBatchWriterFactory runs on the driver and returns a DataWriterFactory that is serialized to executors. That factory must already carry the write schema the executors will use when they create DataWriters.

For shredded variant, we don’t know the shredded schema at planning time. We have to inspect some records to derive it. Doing a read on the driver during createBatchWriterFactory would mean starting a second job inside planning, which is not how DSv2 is intended to work.

Because of that, the current proposed Spark approach is: put the logical variant in the writer factory, on the executor, buffer the first N rows, infer the shredded schema from data, then initialize the concrete writer and flush the buffer. I believe this PR follow the same approach, which seems like a practical solution to me given DSV2's constraints.

pvary · 2025-10-22T08:47:11Z

Thanks for the explanation, @huaxingao! I see several possible workarounds for the DataWriterFactory serialization issue, but I have some more fundamental concerns about the overall approach.
I believe shredding should be driven by future reader requirements rather than by the actual data being written. Ideally, it should remain relatively stable across data files within the same table and originate from a writer job configuration—or even better, from a table-level configuration.

Even if we accept that the written data should dictate the shredding logic, Spark’s implementation—while dependent on input order—is at least somewhat stable. It drops rarely used fields, handles inconsistent types, and limits the number of columns.
I understand this is only a PoC implementation for shredding, but I’m concerned that the current simplifications make it very unstable. If I’m interpreting correctly, the logic infers the type from the first occurrence of each field and creates a column for every field. This could lead to highly inconsistent column layouts within a table, especially in IoT scenarios where multiple sensors produce vastly different data.
Did I miss anything?

aihuaxu · 2025-10-24T16:28:26Z

Thanks @huaxingao and @pvary for reviewing, and thanks to Huaxin for explaining how the writer works in Spark.

Regarding the concern about unstable schemas, Spark's approach makes sense:

If a field appears consistently with a consistent type, create both value and typed_value
If a field appears with inconsistent types, create only value
Drop fields that occur in less than 10% of sampled rows
Cap the total at 300 fields (counting value and typed_value separately)

We could implement similar heuristics. Additionally, making the shredded schema configurable would allow users to choose which fields to shred at write time based on their read patterns.

For this POC, I'd like any feedback on whether there are any significant high-level design options to consider first and if this approach is acceptable. This seems hacky. I may have missed big picture on how the writers work across Spark + Iceberg + Parquet and we may have better way.

github-actions · 2025-11-24T00:19:23Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

Tishj · 2025-11-30T21:28:05Z

This PR caught my eye, as I've implemented the equivalent in DuckDB: duckdb/duckdb#19336

The PR description doesn't give much away, but I think the approach is similar to the proposed (interim) solution here: buffer the first rowgroup, infer the shredded schema from this, then finalize the file schema and start writing data.

We've opted to create a typed_value even though the type isn't 100% consistent within the buffered data, as long as it's the most common. I think you're losing potential compression by not doing that.

We've also added a copy option to force the shredded schema, for debugging purposes and for power users.

As for DECIMAL, it's kind of a special case in the shredding inference. We only shred on a DECIMAL type if all the decimal values we've seen for a column/field have the same width+scale, if any decimal value differs, DECIMAL won't be considered anymore when determining the shredded type of the column/field

yguy-ryft · 2025-12-24T17:43:26Z

This PR is super exciting!
Does this rely on variant shredding support in Spark? Is it supported in Spark 4.1 already, or planned for future releases?

Regarding the heuristics - I'd like to propose adding table properties as hints for variant shredding.
Similarly to properties used for bloom filters, it could be good to introduce something like write.parquet.variant-shredding-enabled.column.col1, which will hint to the writer that this column is important for shredding.
Many variants have important fields for which shredding should be enforced, and other fields which are less central and can be managed with simpler heuristics.
Would love to hear your thoughts!

aihuaxu · 2026-01-09T19:55:50Z

This PR caught my eye, as I've implemented the equivalent in DuckDB: duckdb/duckdb#19336

The PR description doesn't give much away, but I think the approach is similar to the proposed (interim) solution here: buffer the first rowgroup, infer the shredded schema from this, then finalize the file schema and start writing data.

That is correct.

We've opted to create a typed_value even though the type isn't 100% consistent within the buffered data, as long as it's the most common. I think you're losing potential compression by not doing that.

I'm still trying to improve the heuristics to use the most common one as shredding type rather than the first one and probably cap the number of shredded fields, etc. but it doesn't need 100% consistent type to be shredded.

We've also added a copy option to force the shredded schema, for debugging purposes and for power users.

Yeah. I think that makes sense for advanced user to determine the shredded schema since they may know the read pattern.

As for DECIMAL, it's kind of a special case in the shredding inference. We only shred on a DECIMAL type if all the decimal values we've seen for a column/field have the same width+scale, if any decimal value differs, DECIMAL won't be considered anymore when determining the shredded type of the column/field

Why is DECIMAL special here? If we determine DECIMAL4 to be shredded type, then we may shred as DECIMAL4 or not shred if they cannot fit in DECIMAL4, right?

aihuaxu · 2026-01-09T19:58:25Z

This PR is super exciting! Does this rely on variant shredding support in Spark? Is it supported in Spark 4.1 already, or planned for future releases?

Regarding the heuristics - I'd like to propose adding table properties as hints for variant shredding. Similarly to properties used for bloom filters, it could be good to introduce something like write.parquet.variant-shredding-enabled.column.col1, which will hint to the writer that this column is important for shredding. Many variants have important fields for which shredding should be enforced, and other fields which are less central and can be managed with simpler heuristics. Would love to hear your thoughts!

Yeah. I'm also thinking of that too. Will address that separately. Basically based on read pattern, the user can specify the shredding schema.

gkpanda4

When processing JSON objects containing null field values (e.g., {"field": null}), the variant shredding creates schema columns for these null fields instead of omitting them entirely. This would cause schema bloat.

Adding a null check in ParquetVariantUtil.java:386 in the object() method should fix it.

aihuaxu · 2026-01-15T19:39:27Z

When processing JSON objects containing null field values (e.g., {"field": null}), the variant shredding creates schema columns for these null fields instead of omitting them entirely. This would cause schema bloat.

Adding a null check in ParquetVariantUtil.java:386 in the object() method should fix it.

I addressed this null value check in VariantShreddingAnalyzer.java instead. If it's NULL, then we will not add the shredded field.

nssalian · 2026-03-13T17:52:34Z

Hi folks, thank you for your patience. Thanks @aihuaxu for doing all the work for this. The feedback and comments from everyone here was really helpful to make the necessary fixes. I made the following changes after merging the main branch.

I managed to wire the shredding writer through WriterFunction API: Added a writeProperties-aware overload to WriterFunction in BaseFormatModel, forwarded collected properties in ParquetFormatModel, and introduced SparkParquetWriterFunction in SparkFormatModels (v4.1 only) to route to the shredding writer when enabled.
Some fixes were needed: I fixed decimal precision, added some null handling, and applied some heuristics limits too. Implemented field pruning with 10% threshold, 300 cap per SPARK-53659, and deterministic tie-breaking via explicit priority maps.
I added some more tests to the TestVariantShredding.java too to check for various behaviors.

Happy to discuss any of the changes. Please have a look.
CC: @pvary @huaxingao @aihuaxu

pvary · 2026-03-13T19:01:50Z

I managed to wire the shredding writer through WriterFunction API: Added a writeProperties-aware overload to WriterFunction in BaseFormatModel, forwarded collected properties in ParquetFormatModel, and introduced SparkParquetWriterFunction in SparkFormatModels (v4.1 only) to route to the shredding writer when enabled.

I didn't have time to review the full PR yet, but I had the same discussion with @Guosmilesmile on Slack, that I don't like the change in the WriterFunction. You should only create the BufferedWriter first. Based on the properties provided to the writer you collect and buffer the data, and then create the real writer once the buffer is full. At this point the writer decided on the fileSchema (Parquet schema), and that should be enough to create the real writer.

I have started with a similar example, but based on @rdblue's comments we removed the properties and opted for the schemas only solution.

nssalian · 2026-03-13T19:21:03Z

Thanks for the context @pvary , that makes sense.
I'll rework the writeProperties overload and collectedProperties changes. The analyzer and heuristics work is unaffected.

To make sure I implement this correctly: the BufferedWriter currently needs properties for two things, whether spark.sql.iceberg.shred-variants is enabled to decide if shredding applies, and the spark.sql.iceberg.variant.inference.buffer-size to know how many rows to buffer before inferring the schema. If WriterFunction stays schema-only, should the BufferedWriter be created at a higher level (e.g., in the WriteBuilder before createWriterFunc is called) and then delegate to the standard WriterFunction once the schema is inferred?
Or is there another pattern you'd recommend?

pvary · 2026-03-16T15:18:06Z

If WriterFunction stays schema-only, should the BufferedWriter be created at a higher level (e.g., in the WriteBuilder before createWriterFunc is called) and then delegate to the standard WriterFunction once the schema is inferred?

This is how I imagine this today:

public class BufferedFileAppender<D> implements FileAppender<D> {
  private final int bufferRowCount;
  private final Function<List<D>, FileAppender<D>> appenderFactory;
  private List<D> buffer;
  private FileAppender<D> delegate;
  private boolean closed = false;

  public BufferedFileAppender(int bufferRowCount, Function<List<D>, FileAppender<D>> appenderFactory) {
    Preconditions.checkArgument(bufferRowCount > 0, "bufferRowCount must be > 0, got %s", bufferRowCount);
    Preconditions.checkNotNull(appenderFactory, "appenderFactory must not be null");
    this.bufferRowCount = bufferRowCount;
    this.appenderFactory = appenderFactory;
    this.buffer = new ArrayList<>(bufferRowCount);
  }
}

…erFactory

nssalian · 2026-03-23T18:32:13Z

Thanks for the feedback @pvary. @aihuaxu , @pvary and I synced offline to discuss how to move this forward. Adding a note here so that it's easy to review. I've made the following changes:

Refactored per @pvary's suggestion to buffer above the writer. Added BufferedFileAppender in iceberg-core that buffers the first N rows, infers the shredded schema, then creates the real writer.
Moved VariantShreddingAnalyzer from Spark to the parquet module as an abstract class for Spark/Flink reuse.
@Guosmilesmile you can eventually reuse this in your PR.
Added Parquet.WriteBuilder.withFileSchema(MessageType) to supply a pre-computed Parquet schema at write time.
Removed WriterLazyInitializable, SparkParquetWriterWithVariantShredding, and the 4-arg WriterFunction since that wasn't the pattern preferred.
Additional tests and added an extra check for precision in decimals.

@huaxingao, @pvary, @aihuaxu, @RussellSpitzer , please review when you have a chance.

pvary · 2026-03-24T12:51:22Z

core/src/main/java/org/apache/iceberg/io/BufferedFileAppender.java

+    this.bufferRowCount = bufferRowCount;
+    this.appenderFactory = appenderFactory;
+    this.copyFunc = copyFunc;
+    this.buffer = Lists.newArrayList();


nit: Could we initialize with the expected size?

Will clean up.

pvary · 2026-03-24T12:55:05Z

core/src/main/java/org/apache/iceberg/io/BufferedFileAppender.java

+    if (delegate != null) {
+      return delegate.splitOffsets();
+    }
+    return null;


nit: newline after block

pvary · 2026-03-24T12:55:38Z

core/src/main/java/org/apache/iceberg/io/BufferedFileAppender.java

+    if (delegate != null) {
+      return delegate.length();
+    }
+    return 0L;


Are we sure about the 0 length?

since there is nothing buffered yet, 0 makes sense. Let me know if you prefer a different response

pvary · 2026-03-24T12:55:42Z

core/src/main/java/org/apache/iceberg/io/BufferedFileAppender.java

+  public long length() {
+    if (delegate != null) {
+      return delegate.length();
+    }


nit: newline after block

pvary · 2026-03-24T13:51:03Z

core/src/test/java/org/apache/iceberg/io/TestBufferedFileAppender.java

+        for (Record row : bufferedRows) {
+          appender.add(row);
+        }
+        return appender;


nit: newline

I figured spotless would catch some of these, but maybe not. I'll fix them in upcoming commits

pvary · 2026-03-24T13:53:51Z

core/src/main/java/org/apache/iceberg/io/BufferedFileAppender.java

+    Preconditions.checkArgument(
+        bufferRowCount > 0, "bufferRowCount must be > 0, got %s", bufferRowCount);
+    Preconditions.checkNotNull(appenderFactory, "appenderFactory must not be null");
+    Preconditions.checkNotNull(copyFunc, "copyFunc must not be null");


How frequent is the need to copy?
Do we need to get this always, or is it worth to have a non-copy consructor?

pvary · 2026-03-24T13:56:07Z

core/src/test/java/org/apache/iceberg/io/TestBufferedFileAppender.java

+        for (Record row : bufferedRows) {
+          appender.add(row);
+        }


Suggested change

for (Record row : bufferedRows) {

appender.add(row);

}

bufferedRows.forEach(appender::add);

pvary · 2026-03-24T14:05:49Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+    public WriteBuilder withFileSchema(MessageType newFileSchema) {
+      this.fileSchema = newFileSchema;
+      return this;
+    }


Is there an easy way to encode this to the engineSchema?

Or could we just set the createReaderFunction based on the MessageType?

pvary · 2026-03-24T14:06:40Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+    public DataWriteBuilder withFileSchema(MessageType newFileSchema) {
+      appenderBuilder.withFileSchema(newFileSchema);
+      return this;
+    }
+


Why do we expose this?

pvary · 2026-03-24T14:16:15Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SchemaInferenceVisitor.java

+    MessageTypeBuilder builder = Types.buildMessage();
+
+    for (Type field : fields) {
+      if (field != null) {


Do we need these null checks?

pvary · 2026-03-24T14:24:35Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkFileWriterFactory.java

+    Function<List<InternalRow>, FileAppender<InternalRow>> appenderFactory =
+        bufferedRows -> {
+          Preconditions.checkNotNull(bufferedRows, "bufferedRows must not be null");
+          MessageType originalSchema = ParquetSchemaUtil.convert(dataSchema, "table");


This is not the place for things like this.
There should not be any FileFormat related thing here.

ParquetFormatModel should contain this logic as this is parquet specific.
If there are Spark specific parts, that should be a method used to parametrize the ParquetFormatModel, like the WriterBuilderFunction

+1. When shredding is enabled, this skips the FormatModelRegistry / ParquetFormatModel path and constructs the Parquet writer directly, pulling format-specific imports into the format-agnostic SparkFileWriterFactory.

Agree that this is the main part that we can refactor to follow the new FileFormat API.

I'll work on the refactor based on the comments

pvary · 2026-03-24T14:24:50Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkFileWriterFactory.java

+
+  private boolean shouldUseVariantShredding() {
+    // Variant shredding is currently only supported for Parquet files
+    if (dataFileFormat != FileFormat.PARQUET) {


Will clean this

steveloughran · 2026-03-24T19:57:29Z

@nssalian pointed me at this: I've reviewed it as far as I'm safe to. I am doing benchmarks for read performance #15629 which goes alongside this. It'll show when shedded variant performance equals or exceeds that of unshedded.

Right now the results of #https://github.com/apache/iceberg/issues/15628#issuecomment-4120285243 show that is not yet the case, at least in the test setup

steveloughran · 2026-03-24T19:12:16Z

spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/variant/TestVariantShredding.java

+    spark.conf().set(SparkSQLProperties.SHRED_VARIANTS, "true");
+
+    String values =
+        "(1, parse_json('{\"age\": \"25\"}')),"


java17 has that """" multiline string thing which is ideal for json like this

Let me do that

steveloughran · 2026-03-24T19:15:02Z

spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/variant/TestVariantShredding.java

+    MessageType expectedSchema = parquetSchema(address);
+
+    Table table = validationCatalog.loadTable(tableIdent);
+    verifyParquetSchema(table, expectedSchema);


I'd add a test to make sure that row 2 had age of value 30:int, just to make sure that the parser hasn't decided to "be helpful"

steveloughran · 2026-03-24T19:32:30Z

core/src/main/java/org/apache/iceberg/io/BufferedFileAppender.java

+  }
+
+  @Override
+  public void close() throws IOException {


what happens on a create().close() sequence with no data written? it should be a no-op. Is this tested?

I'll add an empty close test

github-actions bot added spark parquet labels Oct 11, 2025

aihuaxu force-pushed the spark-write-iceberg-variant branch from 16b7a09 to dc4f72e Compare October 11, 2025 21:03

aihuaxu marked this pull request as ready for review October 11, 2025 21:15

aihuaxu force-pushed the spark-write-iceberg-variant branch 3 times, most recently from 97851f0 to b87e999 Compare October 13, 2025 16:47

huaxingao reviewed Oct 21, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java Outdated Show resolved Hide resolved

deniskuzZ mentioned this pull request Oct 31, 2025

HIVE-29287: Iceberg: [V3] Variant Shredding support apache/hive#6152

Merged

github-actions bot added the stale label Nov 24, 2025

github-actions bot removed the stale label Dec 1, 2025

gkpanda4 reviewed Jan 14, 2026

View reviewed changes

aihuaxu force-pushed the spark-write-iceberg-variant branch 2 times, most recently from 2e81d79 to 7e1b608 Compare January 15, 2026 19:35

aihuaxu force-pushed the spark-write-iceberg-variant branch 4 times, most recently from 7c805f6 to 67dbe97 Compare January 15, 2026 22:50

nssalian added 2 commits March 13, 2026 10:39

Wire shredding writer through WriterFunction API

4f00355

Fix decimal issue, null handling, heuristics and adding more tests

12e2e88

github-actions bot added the core label Mar 13, 2026

nssalian added 7 commits March 19, 2026 14:08

Merge remote-tracking branch 'apache/main' into pr-14297

a4d2982

Adding BufferedFileAppender for deferred writer init

24a1b0c

Adding VariantShreddingAnalyzer and withFileSchema support

53c6125

Wiring the variant shredding write path via BufferedFileAppender

50b01a4

Merge remote-tracking branch 'apache/main' into pr-14297

0171f38

Fix checkstyle violations in SchemaInferenceVisitor and SparkFileWrit…

b598b5c

…erFactory

Merge remote-tracking branch 'apache/main' into pr-14297

f7b6d59

pvary reviewed Mar 24, 2026

View reviewed changes

steveloughran mentioned this pull request Mar 24, 2026

Core, Spark: Add JMH benchmarks for Variants #15628

Open

3 tasks

steveloughran reviewed Mar 25, 2026

View reviewed changes

Conversation

aihuaxu commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aihuaxu commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aihuaxu commented Oct 21, 2025

Uh oh!

Uh oh!

pvary commented Oct 21, 2025

Uh oh!

huaxingao commented Oct 21, 2025

Uh oh!

pvary commented Oct 22, 2025

Uh oh!

aihuaxu commented Oct 24, 2025

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

Tishj commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yguy-ryft commented Dec 24, 2025

Uh oh!

aihuaxu commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aihuaxu commented Jan 9, 2026

Uh oh!

gkpanda4 left a comment

Choose a reason for hiding this comment

Uh oh!

aihuaxu commented Jan 15, 2026

Uh oh!

nssalian commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Mar 13, 2026

Uh oh!

nssalian commented Mar 13, 2026

Uh oh!

pvary commented Mar 16, 2026

Uh oh!

nssalian commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

aihuaxu commented Oct 11, 2025 •

edited

Loading

aihuaxu commented Oct 15, 2025 •

edited

Loading

Tishj commented Nov 30, 2025 •

edited

Loading

aihuaxu commented Jan 9, 2026 •

edited

Loading

nssalian commented Mar 13, 2026 •

edited

Loading

nssalian commented Mar 23, 2026 •

edited

Loading