-
Notifications
You must be signed in to change notification settings - Fork 25
feat: add partitioned namespace spec #279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
9262dcb to
5b893f0
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
5b893f0 to
ce8bb5c
Compare
ce8bb5c to
bd6fb89
Compare
|
Hi @jackye1995 , I've updated this pr, could you help review when you have time, thanks very much~ |
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a pass for the spec, I think we can merge this first later after one more round of iteration, and start the implementation in https://github.com/lance-format/lance/tree/main/rust/lance-namespace-impls/src/dir and iterate on that.
1a75710 to
d638d2b
Compare
|
Hi @jackye1995 , thanks your nice suggestions! I have updated this pr and targeted all the comments. There's one thing I'm not quite sure about.
I understand that the current spec is mostly consistent with Iceberg in terms of partition transform, except for the JSON serialization method, for example iceberg uses |
|
@wojiaodoubao I edited the doc with some additional designs, let me know if this looks good to you. |
a26e52a to
6f8c6dd
Compare
|
Try to explain my thinking here:
|
| The partitioning information is stored in `partition_spec_v<N>` (e.g., `partition_spec_v1`), which is a JSON array of partition field objects. Each partition field contains: | ||
|
|
||
| * A **field id** uniquely identifying this partition field | ||
| * A **name** for the partition field (used as the column name in `__manifest`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im' not sure if I understand correctly.
I think name need to be unique because it is a column in __manifest. The spec supports partition evolution, it means we need to prevent user from adding a partition field with existed name when 'adding partition column'.
partition field ID: partition field ID serves as the priority of partition keys, so we can still know the order of the fields even if it is converted to a map or something.
The Field ID represents the sequence. So if I have partition fields country, city, and I want to update it to continent , country, city, what should I do?
- First I need to resolve the field id issue: the filed id of
countrymust be less thancontinent, so iscity. I can do it by adding 3 new partition fields. - Second I'll face name conflicts. Can I reuse
countryandcityas partition field name?
Shall we use {partition_field_id}_{partition_field_name} as the column name in __manifest table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Field ID represents the sequence. So if I have partition fields country, city, and I want to update it to continent , country, city, what should I do?
My thinking is that the sequence only matters because we need to know how to organize the namespace levels (as described in #279 (comment))
so if you add a continent, it will be country -> city -> continent. Technically for use cases like query pruning, partitioned join, I don't think this matters because you can still find "all partitions belong to continent Asia" without problem even though it is the last level not the first.
First I need to resolve the field id issue: the filed id of country must be less than continent, so is city. I can do it by adding 3 new partition fields.
The partition specs are not independent, it should have some relationship between versions. In the example you give, we are adding a new partition field, so country and city field IDs should remain unchanged, and continent has the next field ID.
Second I'll face name conflicts. Can I reuse country and city as partition field name?
that's a good point. I think the safest approach is likely to also use only id to reference the partition fields, so the column names would take partition_field_{i} in the __manifest table.
|
|
||
| 1. Query engine analyzes the query predicate to identify filters on partition columns | ||
| 2. For each partition expression, the engine evaluates the expression with the query values to compute the expected partition value(s) | ||
| 3. Engine queries `__manifest` with filters on the partition columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will DirectoryNamespace providing a method to query __manifest table? Or a method to push down filter and get tables identifiers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already do, just need to make it pub(crate): https://github.com/lance-format/lance/blob/main/rust/lance-namespace-impls/src/dir/manifest.rs#L535
+1
Does it mean all tables must be consistent at filed ID?
+1
+1 say in the spec that the namespace and table level schema should remain consistent.
+1. For engines that support DataFusion, we can directly use DataFusion to compute partition values; for those that do not, we need to implement functions with the same semantics ourselves, right?
+1
+1. I was thinking, how should we expose it to engine? It seems a bad idea to expose __manifest table directly. So maybe adding a new function to DirectoryNamespace(not Namespace) to perform partition pruning?
Current spec supports schema evolution, partition evolution, and ACID. Does this mean it adds an additional table format layer on top of the Lance Table Format? I wonder if we need to clarify which table format capabilities can be implemented at the partition layer and which will never be implemented there. |
What specific capabilities are you thinking about? In my mind the cut is clear: Lance table format will never offer partitioning, and only do clustering. This has been clear in the previous conversations. Any feature related to hard-partitioning a logical dataset into physical parts go in this layer. Note that we are only talking about the table data. We could still do partitioning for a specific index, if that makes sense for that index type. |
I was thinking capabilities like: branch, tag, time travel, deletion a physical partition(table), compaction, cleanup etc for partitioned namespace. I like current capabilities set: schema evolution, partition evolution, and read_version. I think it might be too much if we add the capabilities above to partitioned namespace.
+1, agree. |
|
|
||
| **Field ID Stability**: Field IDs (`lance:field_id`) are never reused. Once a field ID is assigned, it permanently identifies that logical column even if the column is later deprecated. This ensures partition specs using `source_id` references remain valid. | ||
|
|
||
| **Partition Field Validity**: If a source column is deprecated, existing partition fields referencing it via `source_id` remain valid for reading existing data. However, new partition spec versions should not reference deprecated columns. To remove a partition field, create a new partition spec version without that field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing partition fields referencing it via
source_idremain valid for reading existing data
In partitioned namespace we won't support time travel. After a filed is removed from schema, it seems we don't need the removed field when reading existing data.
- Engine analyzes query predicate and get partition expressions. The removed field doesn't exist in any predicate(otherwise it breaks semantic check since the removed field is not in schema), so it doesn't exist in any partition expression.
- Engine evaluates the expression, compute the expected partition value(s), queries
__manifestwith filters, retrieves the paths of matchingdatasettables, finally scan them.
So I think we don't need to mark source column as deprecated, instead maybe just remove it. We can also drop the partition_field column since there is noway to reference the partition field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think you are right
Yes. I'm not saying we are focusing engines to use DataFusion. But by stating the expression in it, we do not need to explain the behaviors of the expressions anymore, all the behaviors to all data types, null, computation behaviors are already defined. Any engine can make sure the behaviors match exactly to the DataFusion behavior when evaluating the expression. With that being said, I think most likely we will add some rust implementation to return a list of datasets to scan for a query, and then bind that to other languages, so most engines will eventually call into DataFusion to do the pruning.
This goes to what I was talking about last paragraph. I think we will have a PartitionedNamespace which exposed a plan_scan step to return qualifying dataset and any residual filter. That API can be used by engines to do first level planning. For full DataFusion based execution, we can form the whole execution plan to scan all datasets and do additional reranking if necessary. |
I think there are 2 categories:
|
This pr tries to add partition as an experimental spec. Discussion could be fount at: #272
This PR remains a draft, and we still need to hold a vote to decide whether to introduce the partition spec.