Skip to content

[SPARK-55851][PYTHON] Clarify types of datasource partition and read#54635

Open
gaogaotiantian wants to merge 6 commits intoapache:masterfrom
gaogaotiantian:fix-datasource-read-type
Open

[SPARK-55851][PYTHON] Clarify types of datasource partition and read#54635
gaogaotiantian wants to merge 6 commits intoapache:masterfrom
gaogaotiantian:fix-datasource-read-type

Conversation

@gaogaotiantian
Copy link
Contributor

What changes were proposed in this pull request?

Clarify and unify the types of datasource partition/read function.

  • Correct type annotation
  • Change the default behavior of partition to make it match the documentation.

Why are the changes needed?

Our current type hint for partition and read is wrong. We do accept None as a partition in our code but we did not mention it. This will confuse users as this is documented and user facing.

We also says the default behavior for partition is to return a list with None - but we didn't do that. Instead, we used a very convoluted approach - raise an exception, catch that in the worker and convert that to [None]. That's super unnecessary.

This change should not affect users if they already overwrite their partition and read method. (unless they overwrite with a function that raises NotImplementedError).

Overall this gives us a more consistent interface with correct typing.

Does this PR introduce any user-facing change?

Basically no. Unless user does something crazy.

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

No.

@gaogaotiantian
Copy link
Contributor Author

Need approval from @allisonwang-db :)

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

@gaogaotiantian
Copy link
Contributor Author

Okay we need some further discussion on this matter. The problem is streaming. streaming has slightly different logic. Who do we need to confirm about what the expected behavior is for streaming data source? In an ideal world, streaming data source should have the exact same interface as non-streaming one. So partition method should be able to return [None] and read should take it. However, we should confirm that this is expected first. Seems like @siying wrote the test part? Maybe @HeartSaVioR can also shed some light to the matter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants