Data: Add TCK tests for Metadata Columns in BaseFormatModelTests#15675
Data: Add TCK tests for Metadata Columns in BaseFormatModelTests#15675Guosmilesmile wants to merge 7 commits intoapache:mainfrom
Conversation
12a56cf to
eca8c4d
Compare
eca8c4d to
486652f
Compare
data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java
Outdated
Show resolved
Hide resolved
| new String[] {FEATURE_FILTER, FEATURE_CASE_SENSITIVE, FEATURE_SPLIT}, | ||
| FileFormat.ORC, | ||
| new String[] {FEATURE_REUSE_CONTAINERS}); | ||
| new String[] {FEATURE_REUSE_CONTAINERS, FEATURE_META_ROW_LINEAGE}); |
There was a problem hiding this comment.
How hard would it be to implement this?
There was a problem hiding this comment.
I think it should work. I'll give it a try in the next PR.
There was a problem hiding this comment.
Hi Peter, the corresponding PR has been submitted. #15776
| DataGenerator dataGenerator = new DataGenerators.DefaultSchema(); | ||
| Schema schema = dataGenerator.schema(); | ||
| List<Record> genericRecords = dataGenerator.generateRecords(); | ||
| writeGenericRecords(fileFormat, schema, genericRecords); |
There was a problem hiding this comment.
Could we create rows where the ROW_ID and the LAST_UPDATED_SEQUENCE_NUMBER is set.
It is a valid scenario that some of the rows has a row_id, and for the other rows these are unset
There was a problem hiding this comment.
Yes, I will add a UT to cover it.
| PartitionData partitionData = new PartitionData(partitionType); | ||
| partitionData.set(0, "test_col_a"); |
There was a problem hiding this comment.
Do we need this part or the partition data is read only from the idToConstant
There was a problem hiding this comment.
I think it is necessary. The partition data information is needed for both writing and reading.
There was a problem hiding this comment.
Hmm, after thinking about it, if we are testing the read, we don't actually need to inject partition information here, because it is injected through idToConstant. I'll change it to non-partitioned for testing.
| partitionData.set(0, "test_col_a"); | ||
|
|
||
| DataWriter<Record> writer = | ||
| FormatModelRegistry.dataWriteBuilder(fileFormat, Record.class, encryptedFile) |
There was a problem hiding this comment.
Does the writer remove the partition columns? If so, then we need these tests, but this is more like a writer test
data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java
Outdated
Show resolved
Hide resolved
|
|
||
| protected abstract void assertEquals(Schema schema, List<T> expected, List<T> actual); | ||
|
|
||
| protected abstract Object convertConstantToEngine(Types.NestedField field, Object value); |
There was a problem hiding this comment.
Could we just create a Record and Schema and use convertToEngine instead of this new method?
There was a problem hiding this comment.
I have tried it. When I add partitionData to idToConstant, the Flink side requires it to be a Record. The RowDataConverter.convert used in convertToEngine will force the STRUCT to be converted to a Record and throw an error. For metadata processing on the Flink side, RowDataUtil.convertConstant is used instead.
There was a problem hiding this comment.
Could this help?
private static RowData convert(Types.StructType struct, StructLike record) {
GenericRowData rowData = new GenericRowData(struct.fields().size());
List<Types.NestedField> fields = struct.fields();
for (int i = 0; i < fields.size(); i += 1) {
Types.NestedField field = fields.get(i);
Type fieldType = field.type();
rowData.setField(i, convert(fieldType, record.get(i, Object.class)));
}
return rowData;
}
Notice StructLike record) {, and Object.class)));
There was a problem hiding this comment.
I converted StructLike to Record, and then used convertToEngineRecords + assertEquals for comparison, which allowed me to remove the convertConstantToEngine method.
|
|
||
| protected abstract Object convertConstantToEngine(Types.NestedField field, Object value); | ||
|
|
||
| protected abstract <D> List<D> convertToPartitionIdentity( |
There was a problem hiding this comment.
Could we use an existing method to archive this?
There was a problem hiding this comment.
Along with the aforementioned changes, this part has also been optimized away.
| }; | ||
| } | ||
|
|
||
| private Map<Integer, Object> convertConstantsToEngine( |
There was a problem hiding this comment.
Could we just have the Constants in a GenericRecord and convert it to the engine type?
There was a problem hiding this comment.
Sorry, I'm not quite able to get the direction of the adjustment needed here. However, I have made some changes mentioned above, and I'm not sure if further adjustments are needed in this part. I would appreciate more suggestions.
There was a problem hiding this comment.
This is too convoluted and we still need an getFieldFromEngineRow, so we are not much better of.
If we still need the extra method, we might be better of having a method like:
public static Object convertConstantsToEngine(Type type, Object value);
For Spark it could simply call SparkUtil.internalToSpark
There was a problem hiding this comment.
I found the reason why this part is so complicated. It happens on the Flink side where RowDataConverter.convert used in convertToEngine will force the STRUCT to be converted to a Record. While SparkUtil.internalToSpark has handled this case. Since RowDataConverter is production code, I added handling for PartitionData in TestFlinkFormatModel's convertConstantToEngine.
|
|
||
| @ParameterizedTest | ||
| @FieldSource("FILE_FORMATS") | ||
| void testReadMetadataColumnPartitionBucketTransform(FileFormat fileFormat) throws IOException { |
There was a problem hiding this comment.
Could you help me with highlighting the differences between this test and testReadMetadataColumnPartitionIdentity?
There was a problem hiding this comment.
I took a look, and they are actually pretty much the same, so I removed the bucket part.
|
|
||
| @ParameterizedTest | ||
| @FieldSource("FILE_FORMATS") | ||
| void testReadMetadataColumnPartitionEvolutionAddColumn(FileFormat fileFormat) throws IOException { |
There was a problem hiding this comment.
Could we have a test with addColumnWithDefaultReadValue?
There was a problem hiding this comment.
Add testReaderSchemaEvolutionNewColumnWithDefault , and found orc don't support it.
iceberg/orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java
Lines 407 to 413 in 5dff6f6
53c0dd2 to
d0d35a0
Compare
This pr add TCK tests for metadata column reading in BaseFormatModelTests.
Metadata Colums:
Part of #15415