ORC: Add _row_id and _last_updated_sequence_number raeder in Orc to support lineage#15776
ORC: Add _row_id and _last_updated_sequence_number raeder in Orc to support lineage#15776Guosmilesmile wants to merge 2 commits intoapache:mainfrom
Conversation
f3a4c40 to
d3ad6a3
Compare
| MetadataColumns.ROW_ID.fieldId(), | ||
| MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.fieldId())); |
There was a problem hiding this comment.
These should be already in the META_IDS
There was a problem hiding this comment.
Yes, META_IDS contains the row ID and last update sequence. The original code would delete all metadata-related fields, but in the lineage scenario, _row_id exists in the datafile and should not be removed. Therefore, we need to use difference to remove ROW_ID and LAST_UPDATED_SEQUENCE_NUMBER here.
| OrcValueReader<Long> fileIdReader = | ||
| readerIndex < readerList.size() | ||
| ? (OrcValueReader<Long>) readerList.get(readerIndex) | ||
| : null; |
There was a problem hiding this comment.
Please help me understand why we do this
There was a problem hiding this comment.
I understand that readerList represents physical columns, while ROW_ID/LAST_UPDATED_SEQUENCE_NUMBER may only exist in the logical projection. Although in my testing the counts are consistent, I cannot guarantee that there are no other scenarios where the projection and physical fields are inconsistent. So I added fileIdReader == null to fall back to the fallback path, which has a bit of a defensive programming flavor.
|
Add ut for spark , this pr don't implemented lineage in spark vector read in ORC, it will support it in the follow pr. |
While working on improving the TCK for File Format, we found that in V3 tables, we support lineage in Parquet and Avro, but we haven't implemented this feature in ORC.
This PR aims to add
_row_idand_last_updated_sequence_numberreader in ORC to support lineage.This pr don't implemented lineage in spark vector read in ORC, it will support it in the follow pr.