feat: add partitioned namespace spec #279

wojiaodoubao · 2025-12-11T07:14:52Z

This pr tries to add partition as an experimental spec. Discussion could be fount at: #272

This PR remains a draft, and we still need to hold a vote to decide whether to introduce the partition spec.

chatgpt-codex-connector · 2025-12-11T07:14:57Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

github-actions · 2025-12-11T07:15:12Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

docs/src/dir/partition-spec.md

docs/src/rest/partition-spec.md

chatgpt-codex-connector · 2025-12-13T03:22:11Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

wojiaodoubao · 2025-12-17T07:55:47Z

Hi @jackye1995 , I've updated this pr, could you help review when you have time, thanks very much~

docs/src/dir/partition-spec.md

jackye1995

Did a pass for the spec, I think we can merge this first later after one more round of iteration, and start the implementation in https://github.com/lance-format/lance/tree/main/rust/lance-namespace-impls/src/dir and iterate on that.

wojiaodoubao · 2026-01-02T16:40:15Z

Hi @jackye1995 , thanks your nice suggestions! I have updated this pr and targeted all the comments. There's one thing I'm not quite sure about.

we should expand this in an independent section to talk about partition transform and partition pruning. This part we should in general just reference how iceberg does it, with field ID, source ID, all those things.

I understand that the current spec is mostly consistent with Iceberg in terms of partition transform, except for the JSON serialization method, for example iceberg uses bucket[N] while lance use bucket and bucket_size=N. Additionally, the current solution is relatively simple and does not define partition field IDs. Is this ok?

jackye1995 · 2026-01-03T06:26:42Z

@wojiaodoubao I edited the doc with some additional designs, let me know if this looks good to you.

jackye1995 · 2026-01-04T02:53:55Z

Try to explain my thinking here:

versioned partition spec: partition evolution is a requirement to be comparable against similar projects, so we do need to version the partition spec
source field ID: lance table schema fields do have field ID, it's recorded in field metadata, so we should use that instead of field name to be schema evolution-friendly
partition field ID: partition field ID serves as the priority of partition keys, so we can still know the order of the fields even if it is converted to a map or something.
schema: the current approach is not fully correct yet, I still want to think a bit more on it. the schema id is used in spec as source id, so we need to know all fields that are still referenced in the partition specs, even if it is already dropped. Currently I just say we keep the unioned schema, but that might be not enough because the field ID could differ in namespace schema and underlying table schema. Maybe the easiest way is to just say in the spec that the namespace and table level schema should remain consistent.
partition expression: I switched to use just DataFusion expression to describe the partition transform, I feel this is more generic and also makes it more consistent with over Lance philosophy that we try to use Arrow & DataFusion semantics, rather than invent our own semantics for query related stuffs.
partition namespace name: if we use col=val style, if value has $ character it would not work. So we just use a random name and the partition key and value can be found in namespace properties (generated dynamically from __manifest column data). This also allow us to avoid leaking data information in the namespace name for users that want high security measures.
partition pruning: to make pruning work more efficiently, we directly store partition info as columns, and pruning is just a SQL call to select the table matching the predicate.
transaction support: I added the additional read_version column in __manifest so that if user desire multi-partition transaction, __manifest can be used as the main medium to coordinate commits and enforce read isolation. More details are probably needed in this section to describe the read and write path in more details.

docs/src/dir/partition-spec.md

docs/src/dir/catalog-spec.md

wojiaodoubao · 2026-01-04T02:17:10Z

docs/src/dir/partition-spec.md

+The partitioning information is stored in `partition_spec_v<N>` (e.g., `partition_spec_v1`), which is a JSON array of partition field objects. Each partition field contains:
+
+* A **field id** uniquely identifying this partition field
+* A **name** for the partition field (used as the column name in `__manifest`)


Im' not sure if I understand correctly.

I think name need to be unique because it is a column in __manifest. The spec supports partition evolution, it means we need to prevent user from adding a partition field with existed name when 'adding partition column'.

partition field ID: partition field ID serves as the priority of partition keys, so we can still know the order of the fields even if it is converted to a map or something.

The Field ID represents the sequence. So if I have partition fields country, city, and I want to update it to continent , country, city, what should I do?

First I need to resolve the field id issue: the filed id of country must be less than continent, so is city. I can do it by adding 3 new partition fields.

Second I'll face name conflicts. Can I reuse country and city as partition field name?

Shall we use {partition_field_id}_{partition_field_name} as the column name in __manifest table?

The Field ID represents the sequence. So if I have partition fields country, city, and I want to update it to continent , country, city, what should I do?

My thinking is that the sequence only matters because we need to know how to organize the namespace levels (as described in #279 (comment))

so if you add a continent, it will be country -> city -> continent. Technically for use cases like query pruning, partitioned join, I don't think this matters because you can still find "all partitions belong to continent Asia" without problem even though it is the last level not the first.

First I need to resolve the field id issue: the filed id of country must be less than continent, so is city. I can do it by adding 3 new partition fields.

The partition specs are not independent, it should have some relationship between versions. In the example you give, we are adding a new partition field, so country and city field IDs should remain unchanged, and continent has the next field ID.

Second I'll face name conflicts. Can I reuse country and city as partition field name?

that's a good point. I think the safest approach is likely to also use only id to reference the partition fields, so the column names would take partition_field_{i} in the __manifest table.

wojiaodoubao · 2026-01-04T02:41:55Z

docs/src/dir/partition-spec.md

+
+1. Query engine analyzes the query predicate to identify filters on partition columns
+2. For each partition expression, the engine evaluates the expression with the query values to compute the expected partition value(s)
+3. Engine queries `__manifest` with filters on the partition columns


Will DirectoryNamespace providing a method to query __manifest table? Or a method to push down filter and get tables identifiers?

I think we already do, just need to make it pub(crate): https://github.com/lance-format/lance/blob/main/rust/lance-namespace-impls/src/dir/manifest.rs#L535

docs/src/dir/partition-spec.md

wojiaodoubao · 2026-01-04T04:59:50Z

versioned partition spec: partition evolution is a requirement to be comparable against similar projects, so we do need to version the partition spec

+1

source field ID: lance table schema fields do have field ID, it's recorded in field metadata, so we should use that instead of field name to be schema evolution-friendly

Does it mean all tables must be consistent at filed ID?

partition field ID: partition field ID serves as the priority of partition keys, so we can still know the order of the fields even if it is converted to a map or something.

+1

schema: the current approach is not fully correct yet, I still want to think a bit more on it. the schema id is used in spec as source id, so we need to know all fields that are still referenced in the partition specs, even if it is already dropped. Currently I just say we keep the unioned schema, but that might be not enough because the field ID could differ in namespace schema and underlying table schema. Maybe the easiest way is to just say in the spec that the namespace and table level schema should remain consistent.

+1 say in the spec that the namespace and table level schema should remain consistent.

partition expression: I switched to use just DataFusion expression to describe the partition transform, I feel this is more generic and also makes it more consistent with over Lance philosophy that we try to use Arrow & DataFusion semantics, rather than invent our own semantics for query related stuffs.

+1. For engines that support DataFusion, we can directly use DataFusion to compute partition values; for those that do not, we need to implement functions with the same semantics ourselves, right?

partition namespace name: if we use col=val style, if value has $ character it would not work. So we just use a random name and the partition key and value can be found in namespace properties (generated dynamically from __manifest column data). This also allow us to avoid leaking data information in the namespace name for users that want high security measures.

+1

partition pruning: to make pruning work more efficiently, we directly store partition info as columns, and pruning is just a SQL call to select the table matching the predicate.

+1. I was thinking, how should we expose it to engine? It seems a bad idea to expose __manifest table directly. So maybe adding a new function to DirectoryNamespace(not Namespace) to perform partition pruning?

transaction support: I added the additional read_version column in __manifest so that if user desire multi-partition transaction, __manifest can be used as the main medium to coordinate commits and enforce read isolation. More details are probably needed in this section to describe the read and write path in more details.

Current spec supports schema evolution, partition evolution, and ACID. Does this mean it adds an additional table format layer on top of the Lance Table Format? I wonder if we need to clarify which table format capabilities can be implemented at the partition layer and which will never be implemented there.

jackye1995 · 2026-01-04T07:37:37Z

Does this mean it adds an additional table format layer on top of the Lance Table Format? I wonder if we need to clarify which table format capabilities can be implemented at the partition layer and which will never be implemented there.

What specific capabilities are you thinking about? In my mind the cut is clear: Lance table format will never offer partitioning, and only do clustering. This has been clear in the previous conversations. Any feature related to hard-partitioning a logical dataset into physical parts go in this layer.

Note that we are only talking about the table data. We could still do partitioning for a specific index, if that makes sense for that index type.

wojiaodoubao · 2026-01-04T07:47:40Z

What specific capabilities are you thinking about?

I was thinking capabilities like: branch, tag, time travel, deletion a physical partition(table), compaction, cleanup etc for partitioned namespace.

I like current capabilities set: schema evolution, partition evolution, and read_version. I think it might be too much if we add the capabilities above to partitioned namespace.

Any feature related to hard-partitioning a logical dataset into physical parts go in this layer.

+1, agree.

wojiaodoubao · 2026-01-04T12:33:24Z

docs/src/dir/partition-spec.md

+
+**Field ID Stability**: Field IDs (`lance:field_id`) are never reused. Once a field ID is assigned, it permanently identifies that logical column even if the column is later deprecated. This ensures partition specs using `source_id` references remain valid.
+
+**Partition Field Validity**: If a source column is deprecated, existing partition fields referencing it via `source_id` remain valid for reading existing data. However, new partition spec versions should not reference deprecated columns. To remove a partition field, create a new partition spec version without that field.


existing partition fields referencing it via source_id remain valid for reading existing data

In partitioned namespace we won't support time travel. After a filed is removed from schema, it seems we don't need the removed field when reading existing data.

Engine analyzes query predicate and get partition expressions. The removed field doesn't exist in any predicate(otherwise it breaks semantic check since the removed field is not in schema), so it doesn't exist in any partition expression.

Engine evaluates the expression, compute the expected partition value(s), queries __manifest with filters, retrieves the paths of matching dataset tables, finally scan them.

So I think we don't need to mark source column as deprecated, instead maybe just remove it. We can also drop the partition_field column since there is noway to reference the partition field.

yes I think you are right

jackye1995 · 2026-01-04T18:09:30Z

For engines that support DataFusion, we can directly use DataFusion to compute partition values; for those that do not, we need to implement functions with the same semantics ourselves, right?

Yes. I'm not saying we are focusing engines to use DataFusion. But by stating the expression in it, we do not need to explain the behaviors of the expressions anymore, all the behaviors to all data types, null, computation behaviors are already defined. Any engine can make sure the behaviors match exactly to the DataFusion behavior when evaluating the expression.

With that being said, I think most likely we will add some rust implementation to return a list of datasets to scan for a query, and then bind that to other languages, so most engines will eventually call into DataFusion to do the pruning.

how should we expose it to engine? It seems a bad idea to expose __manifest table directly. So maybe adding a new function to DirectoryNamespace(not Namespace) to perform partition pruning?

This goes to what I was talking about last paragraph. I think we will have a PartitionedNamespace which exposed a plan_scan step to return qualifying dataset and any residual filter. That API can be used by engines to do first level planning. For full DataFusion based execution, we can form the whole execution plan to scan all datasets and do additional reranking if necessary.

jackye1995 · 2026-01-04T19:23:30Z

I was thinking capabilities like: branch, tag, time travel, deletion a physical partition(table), compaction, cleanup etc for partitioned namespace

I think there are 2 categories:

features in the table format: by the nature of using a __manifest Lance table, implementations of the partitioned namespace spec is able to do branching, time travel, deletion vector, etc. against contents in __manifest. I don't think we need to strictly enumerate all those features or ban any features, since the spec is mainly describing the storage layout and not all features it can or cannot do.
features offered in rust SDK and engine implementations: for example compaction, cleanup, etc. I think we will just start from the initial set of features like basic read and write support, and see whatever new features we need step by step. We might want to do some planning together. Although I put things like partition evolution in the spec, that's mainly to ensure the spec is future proof and we don't need to immediately do a v2 in short term. We don't have to implement it right away if features like compaction across partitions is more important.

doc: add partitioin spec (experimental)

5db69f8

wojiaodoubao marked this pull request as draft December 11, 2025 07:15

jackye1995 reviewed Dec 12, 2025

View reviewed changes

docs/src/dir/partition-spec.md Outdated Show resolved Hide resolved

jackye1995 reviewed Dec 12, 2025

View reviewed changes

docs/src/rest/partition-spec.md Outdated Show resolved Hide resolved

wojiaodoubao changed the title ~~doc: add partitioin spec (experimental)~~ doc: add partitioin spec - experimental Dec 12, 2025

wojiaodoubao changed the title ~~doc: add partitioin spec - experimental~~ doc: add partitioin spec Dec 12, 2025

fix

d6fb64f

wojiaodoubao force-pushed the partition-spec branch from 9262dcb to 5b893f0 Compare December 13, 2025 03:21

wojiaodoubao marked this pull request as ready for review December 13, 2025 03:22

wojiaodoubao force-pushed the partition-spec branch from 5b893f0 to ce8bb5c Compare December 13, 2025 03:25

wojiaodoubao changed the title ~~doc: add partitioin spec~~ docs: add partitioin spec Dec 13, 2025

github-actions bot added the documentation Improvements or additions to documentation label Dec 13, 2025

refactor as a special directory namespace

bd6fb89

wojiaodoubao force-pushed the partition-spec branch from ce8bb5c to bd6fb89 Compare December 13, 2025 03:27