Data: Add TCK tests for Metadata Columns in BaseFormatModelTests by Guosmilesmile · Pull Request #15675 · apache/iceberg

Guosmilesmile · 2026-03-18T15:17:48Z

This pr add TCK tests for metadata column reading in BaseFormatModelTests.

Metadata Colums:

FILE_PATH
SPEC_ID
ROW_POSITION
IS_DELETED
Lineage
- ROW_ID
- LAST_UPDATED_SEQUENCE_NUMBER
PARTITION_COLUMN
- Transformations
- Partition evolution (adding and removing columns)

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

pvary · 2026-03-23T14:56:48Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

          new String[] {FEATURE_FILTER, FEATURE_CASE_SENSITIVE, FEATURE_SPLIT},
          FileFormat.ORC,
-          new String[] {FEATURE_REUSE_CONTAINERS});
+          new String[] {FEATURE_REUSE_CONTAINERS, FEATURE_META_ROW_LINEAGE});


How hard would it be to implement this?

I think it should work. I'll give it a try in the next PR.

Hi Peter, the corresponding PR has been submitted. #15776

pvary · 2026-03-23T15:01:25Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+    DataGenerator dataGenerator = new DataGenerators.DefaultSchema();
+    Schema schema = dataGenerator.schema();
+    List<Record> genericRecords = dataGenerator.generateRecords();
+    writeGenericRecords(fileFormat, schema, genericRecords);


Could we create rows where the ROW_ID and the LAST_UPDATED_SEQUENCE_NUMBER is set.
It is a valid scenario that some of the rows has a row_id, and for the other rows these are unset

Yes, I will add a UT to cover it.

pvary · 2026-03-23T15:06:41Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+    PartitionData partitionData = new PartitionData(partitionType);
+    partitionData.set(0, "test_col_a");


Do we need this part or the partition data is read only from the idToConstant

I think it is necessary. The partition data information is needed for both writing and reading.

Hmm, after thinking about it, if we are testing the read, we don't actually need to inject partition information here, because it is injected through idToConstant. I'll change it to non-partitioned for testing.

pvary · 2026-03-23T15:09:11Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+    partitionData.set(0, "test_col_a");
+
+    DataWriter<Record> writer =
+        FormatModelRegistry.dataWriteBuilder(fileFormat, Record.class, encryptedFile)


Does the writer remove the partition columns? If so, then we need these tests, but this is more like a writer test

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

pvary · 2026-03-23T15:19:59Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java


  protected abstract void assertEquals(Schema schema, List<T> expected, List<T> actual);

+  protected abstract Object convertConstantToEngine(Types.NestedField field, Object value);


Could we just create a Record and Schema and use convertToEngine instead of this new method?

I have tried it. When I add partitionData to idToConstant, the Flink side requires it to be a Record. The RowDataConverter.convert used in convertToEngine will force the STRUCT to be converted to a Record and throw an error. For metadata processing on the Flink side, RowDataUtil.convertConstant is used instead.

iceberg/flink/v2.1/flink/src/test/java/org/apache/iceberg/flink/RowDataConverter.java

Lines 114 to 115 in 0518651

case STRUCT:

return convert(type.asStructType(), (Record) object);

Could this help?

private static RowData convert(Types.StructType struct, StructLike record) { GenericRowData rowData = new GenericRowData(struct.fields().size()); List<Types.NestedField> fields = struct.fields(); for (int i = 0; i < fields.size(); i += 1) { Types.NestedField field = fields.get(i); Type fieldType = field.type(); rowData.setField(i, convert(fieldType, record.get(i, Object.class))); } return rowData; }

Notice StructLike record) {, and Object.class)));

I converted StructLike to Record, and then used convertToEngineRecords + assertEquals for comparison, which allowed me to remove the convertConstantToEngine method.

pvary · 2026-03-23T15:20:32Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java


+  protected abstract Object convertConstantToEngine(Types.NestedField field, Object value);
+
+  protected abstract <D> List<D> convertToPartitionIdentity(


Could we use an existing method to archive this?

Along with the aforementioned changes, this part has also been optimized away.

pvary · 2026-03-24T12:18:58Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

    };
  }
+
+  private Map<Integer, Object> convertConstantsToEngine(


Could we just have the Constants in a GenericRecord and convert it to the engine type?

Sorry, I'm not quite able to get the direction of the adjustment needed here. However, I have made some changes mentioned above, and I'm not sure if further adjustments are needed in this part. I would appreciate more suggestions.

This is too convoluted and we still need an getFieldFromEngineRow, so we are not much better of.

If we still need the extra method, we might be better of having a method like:

public static Object convertConstantsToEngine(Type type, Object value);

For Spark it could simply call SparkUtil.internalToSpark

I found the reason why this part is so complicated. It happens on the Flink side where RowDataConverter.convert used in convertToEngine will force the STRUCT to be converted to a Record. While SparkUtil.internalToSpark has handled this case. Since RowDataConverter is production code, I added handling for PartitionData in TestFlinkFormatModel's convertConstantToEngine.

pvary · 2026-03-24T17:05:45Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+
+  @ParameterizedTest
+  @FieldSource("FILE_FORMATS")
+  void testReadMetadataColumnPartitionBucketTransform(FileFormat fileFormat) throws IOException {


Could you help me with highlighting the differences between this test and testReadMetadataColumnPartitionIdentity?

I took a look, and they are actually pretty much the same, so I removed the bucket part.

pvary · 2026-03-24T17:06:36Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+
+  @ParameterizedTest
+  @FieldSource("FILE_FORMATS")
+  void testReadMetadataColumnPartitionEvolutionAddColumn(FileFormat fileFormat) throws IOException {


Could we have a test with addColumnWithDefaultReadValue?

Add testReaderSchemaEvolutionNewColumnWithDefault , and found orc don't support it.

iceberg/orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java

Lines 407 to 413 in 5dff6f6

if (field.initialDefault() != null) {

throw new UnsupportedOperationException(

String.format(

"ORC cannot read default value for field %s (%s): %s",

root.findColumnName(fieldId), type, field.initialDefault()));

}

github-actions bot added the data label Mar 18, 2026

Guosmilesmile force-pushed the tck_metadata branch 4 times, most recently from 12a56cf to eca8c4d Compare March 20, 2026 16:46

Guosmilesmile marked this pull request as draft March 21, 2026 15:21

Data: Add TCK tests for Metadata Columns in BaseFormatModelTests

486652f

Guosmilesmile force-pushed the tck_metadata branch from eca8c4d to 486652f Compare March 22, 2026 05:47

github-actions bot added spark flink labels Mar 22, 2026

Guosmilesmile marked this pull request as ready for review March 22, 2026 12:28

pvary reviewed Mar 23, 2026

View reviewed changes

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java Outdated Show resolved Hide resolved

pvary reviewed Mar 23, 2026

View reviewed changes

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java Outdated Show resolved Hide resolved

pvary reviewed Mar 23, 2026

View reviewed changes

Guosmilesmile added 2 commits March 24, 2026 19:33

Address Comment

455a30d

fix ci

2fa854a

pvary reviewed Mar 24, 2026

View reviewed changes

Address Comment

d0d35a0

Guosmilesmile force-pushed the tck_metadata branch from 53c0dd2 to d0d35a0 Compare March 25, 2026 11:54

Guosmilesmile added 3 commits March 25, 2026 19:59

fix other version

8ad4e9c

add testReaderSchemaEvolutionNewColumnWithDefault

ac01dbd

remove useless

14d33d0

		PartitionData partitionData = new PartitionData(partitionType);
		partitionData.set(0, "test_col_a");


		protected abstract void assertEquals(Schema schema, List<T> expected, List<T> actual);

		protected abstract Object convertConstantToEngine(Types.NestedField field, Object value);

	case STRUCT:
	return convert(type.asStructType(), (Record) object);


		protected abstract Object convertConstantToEngine(Types.NestedField field, Object value);

		protected abstract <D> List<D> convertToPartitionIdentity(

	if (field.initialDefault() != null) {
	throw new UnsupportedOperationException(
	String.format(
	"ORC cannot read default value for field %s (%s): %s",
	root.findColumnName(fieldId), type, field.initialDefault()));
	}

Conversation

Guosmilesmile commented Mar 18, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants