Skip to content

ORC: Add _row_id and _last_updated_sequence_number raeder in Orc to support lineage#15776

Draft
Guosmilesmile wants to merge 2 commits intoapache:mainfrom
Guosmilesmile:orc_rowid
Draft

ORC: Add _row_id and _last_updated_sequence_number raeder in Orc to support lineage#15776
Guosmilesmile wants to merge 2 commits intoapache:mainfrom
Guosmilesmile:orc_rowid

Conversation

@Guosmilesmile
Copy link
Copy Markdown
Contributor

@Guosmilesmile Guosmilesmile commented Mar 26, 2026

While working on improving the TCK for File Format, we found that in V3 tables, we support lineage in Parquet and Avro, but we haven't implemented this feature in ORC.

This PR aims to add _row_id and _last_updated_sequence_number reader in ORC to support lineage.

This pr don't implemented lineage in spark vector read in ORC, it will support it in the follow pr.

Comment on lines +794 to +795
MetadataColumns.ROW_ID.fieldId(),
MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.fieldId()));
Copy link
Copy Markdown
Contributor

@pvary pvary Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be already in the META_IDS

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, META_IDS contains the row ID and last update sequence. The original code would delete all metadata-related fields, but in the lineage scenario, _row_id exists in the datafile and should not be removed. Therefore, we need to use difference to remove ROW_ID and LAST_UPDATED_SEQUENCE_NUMBER here.

Comment on lines +179 to +182
OrcValueReader<Long> fileIdReader =
readerIndex < readerList.size()
? (OrcValueReader<Long>) readerList.get(readerIndex)
: null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help me understand why we do this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that readerList represents physical columns, while ROW_ID/LAST_UPDATED_SEQUENCE_NUMBER may only exist in the logical projection. Although in my testing the counts are consistent, I cannot guarantee that there are no other scenarios where the projection and physical fields are inconsistent. So I added fileIdReader == null to fall back to the fallback path, which has a bit of a defensive programming flavor.

@Guosmilesmile Guosmilesmile marked this pull request as draft March 27, 2026 15:01
@github-actions github-actions bot added the spark label Mar 27, 2026
@Guosmilesmile
Copy link
Copy Markdown
Contributor Author

Add ut for spark , this pr don't implemented lineage in spark vector read in ORC, it will support it in the follow pr.

@Guosmilesmile Guosmilesmile marked this pull request as ready for review March 27, 2026 16:30
@Guosmilesmile Guosmilesmile marked this pull request as draft March 27, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants