spec: register table #22

yanghua · 2025-04-15T05:00:43Z

No description provided.

jackye1995 · 2025-04-15T06:18:26Z

spec/catalog.yaml

+          description: Optional write options for the table creation.
+          nullable: true
+
+    TableDefinition:


I guess I am a bit confused about this table definition part in general. Why is there a column definition list + a schema? And for the schema, I was originally expecting it to be resembling the lance.file.Field, but seems like it is not.

But before we get into those details, why not just passing a table location and an optional manifest location to create the table?

You can refer to this entity: https://github.com/lancedb/lancedb/blob/24e42ccd4d9fe9644038d3dd77bd3206e9ccbdc2/rust/lancedb/src/database.rs#L125.

But before we get into those details, why not just passing a table location and an optional manifest location to create the table?

AFAIK, currently, lance can not create a dataset without a schema?

Oh so basically I mean you directly create the first manifest on client side, then you create a table object in catalog based on that

I mean you directly create the first manifest on client side

You mean the client side is the environment of the HTTP client caller?

I mean, client just pass these params to the catalog server, it would call the lance/lancedb SDK to create a lance dataset.

the client side is the environment of the HTTP client caller?

Yes

Seems like there are 2 topics to be unfolded here:

Q1: Do we need to pass in actual table metadata in shape of JSON, or can we get around with just passing a path

I was originally thinking about a very simple way to CreateTable, which is user creates the table + manifest on client side (e.g. in Spark) using the Lance library, and then just call the catalog to say "here is the table at some location, and optionally here is the latest manifest of this table".

For UpdateTable, the client side first produce the new manifest file for the Lance table. if the table is catalog managed (using the updated terminology in #11) , then the client side calls UpdateTable to update manifest location after the manfiest is produced at the client side. If the table is storage managed, then the client side does not need to engage with catalog at all since the next GetTable call will automatically resolve the table to the newly committed version.

Technically all these could work. If the whole purpose of the catalog is to let a compute engine know "here is a Lance table and it is at some location, go from there and use the Lance library to read or write it", I feel this direction is actually sufficient, it is also very simple since we just don't deal with any table metadata model at the catalog layer, there is no need to define any complicated JSON model

Another benefit is that, you don't need to deal with issues like the catalog service is using a different version of the Lance library from the client side to run the commit, causing any diverging behavior (which is actually an issue we have seen in IRC).

This is actually how HMS integrates with Iceberg for a very long time without the IRC middle layer.

But I don't know if that would defeat your purpose of having a catalog service, since I don't know what is your expected scope. And if your scope is larger, do you think this would be a good middle step, or do you think we should jump directly into defining all the table metadata shape?

Q2: Do we follow the LanceDB table metadata shape, or the Lance one

If we do want to let the catalog service perform all the actual table commit operations at service side, then we should discuss this question.

This seems like the main reason I was confused about the definition of TableDefinition, ColumnDefinition, etc. because these are the ones used in the LanceDB CreateTableRequest but I was expecting the model to be the ones that are feeded to create a new Lance dataset manifest. From code perspective, it is basically:

impl Manifest { pub fn new( schema: Schema, // this is the lance_core::datatypes::Schema fragments: Arc<Vec<Fragment>>, data_storage_format: DataStorageFormat, blob_dataset_version: Option<u64>, )

and within these inputs we probably need just schema (fields + metadata) and data_storage_format

I feel that makes more sense for decoupling Lance and LanceDB, as it makes the catalog closer to the Lance side. Here we will use the Lance schema definition that is aligned with the protobuf shape, and it is not coupled with the concept like physical vs embedding, and we can always convert an Arrow schema to Lance schema through fn try_from(schema: &ArrowSchema).

Sure let's do that!

I was thinking about how Spark would integrate with this, when integrated with engine, the flag exposed to users would be something like "create the table on the client side or server side", consider the following:

@Override public Table createTable( Identifier ident, StructType schema, Transform[] partitions, Map<String, String> properties) throws TableAlreadyExistsException, NoSuchNamespaceException { try { if (createTableAtClientSide) { // derived from Spark config LanceConfig config = LanceConfig.from(options, ident.name()); WriteParams params = SparkOptions.genWriteParamsFromConfig(config); LanceDatasetAdapter.createDataset(ident.name(), schema, params); RegisterTableRequest req = new RegisterTableRequest(); req.setName(...); req.setLocation(...); tableApi.registerTable(req); } else { CreateTableRequest req = new CreateTableRequest(); req.setName(...); req.setLocation(...); req.setSchema(...); tableApi.createTable(req); } } catch (IllegalArgumentException e) { throw new TableAlreadyExistsException(ident.name(), new Some<>(e)); } catch (ApiException e) { ... } return new LanceDataset(LanceConfig.from(options, ident.name()), schema); }

It still feels more natural to me to just call CreateTable API instead of calling 2 different APIs under the createTable connector logic:

@Override public Table createTable( Identifier ident, StructType schema, Transform[] partitions, Map<String, String> properties) throws TableAlreadyExistsException, NoSuchNamespaceException { try { CreateTableRequest req = new CreateTableRequest(); req.setName(...); req.setLocation(...); if (createTableAtClientSide) { // derived from Spark config LanceConfig config = LanceConfig.from(options, ident.name()); WriteParams params = SparkOptions.genWriteParamsFromConfig(config); LanceDatasetAdapter.createDataset(ident.name(), schema, params); req.setRegisterOnly(true); } else { req.setSchema(...); req.setRegisterOnly(false); } tableApi.createTable(req); } catch (IllegalArgumentException e) { throw new TableAlreadyExistsException(ident.name(), new Some<>(e)); } catch (ApiException e) { ... } return new LanceDataset(LanceConfig.from(options, ident.name()), schema); }

It might be another argument for having just CreateTable, but maybe it's just personal preference at this point, and it could be argued both ways. We have stayed on this topic for quite long, let's just pick an option and move forward 😃

How about createTable in spark integration only consider using create table interface spec? Do not consider register table?

let's just pick an option and move forward

Let's change this PR to register table?

Sure let's do that

jackye1995 · 2025-04-15T06:22:27Z

spec/catalog.yaml

+        mode:
+          type: string
+          enum: [ CREATE, EXIST_OK, OVERWRITE ]
+        write_options:


What is the use case for this? What could be an example write option? And if we have write option, why not having other things like read options? Feels like we should just have a generic map of options/properties, which is what you return in response anyway.

If we use posix protocol, the write option could be ignored. If we use object storage, we need a write option to package some config options, e.g. ak, sk, endpoint, region... Because we need to create a lance dataset(write the first version).

That's actually an interesting perspective I never thought about. In IRC we did not have any configuration like that, and it's basically assumed that the service will figure out the credentials and other storage configs based on the table location user passes in. Either the caller intrinsiccally has the permission (e.g. using Amazon SigV4 auth and then create table in S3), or the service has some other mechanism outside for user to configure access permissions. But in general it feels to me that passing ak and sk directly in request is not a good security practice since that is essentially displayed in plain text on the receiver end.

But in general it feels to me that passing ak and sk directly in request is not a good security practice since that is essentially displayed in plain text on the receiver end.

Yes, in terms of security practice, it is true that static and clear text AK, SK should not be passed.

But using this API, users can also receive session token(assume role), some other write configuration. I just exposed an example.

it's basically assumed that the service will figure out the credentials and other storage configs based on the table location user passes in.

Yes, but the capacities are provided in some business catalog adapter, right? We will integrate the lance catalog with our business catalog, and it would provide these features.

But, IMO, we'd have a way to let users pass general params to create a lance dataset, right?

jackye1995 · 2025-04-15T06:24:47Z

spec/catalog.yaml

+        - name
+      properties:
+        name: { type: string }
+        properties:


Just a nit, we might want to make the terminology consistent. In Database we used options but here we are using properties. In general I think properties is a bit more preferable, because there are SQL key words like DBPROPERTIES, TBLPROPERTIES. And it gives a feeling that it is intrinsic to the object. What do your think?

About the option, do you mean write_option?

Also in CreateNamespaceRequest

The trade-off I think, is that if we want to align with the naming convention of lance/lancedb. It uses some options in their public APIs. I am open about it.

yanghua · 2025-04-21T13:47:50Z

cc @jackye1995

jackye1995

looks good to me!

yanghua · 2025-04-22T00:36:11Z

@jackye1995 Does the codegen check ci have any legacy issues? I ran make gen under python dir? It checks the diff failed:

-test/test_get_table_response.py
-test/test_register_table_request.py
-test/test_table_api.py

github-actions bot added the spec Restful openapi spec label Apr 15, 2025

yanghua force-pushed the i-5 branch from b6fca0d to 57e2735 Compare April 15, 2025 05:05

yanghua changed the title ~~codegen: create table~~ spec: create table Apr 15, 2025

github-actions bot added the enhancement New feature or request label Apr 15, 2025

yanghua marked this pull request as ready for review April 15, 2025 05:16

yanghua requested a review from jackye1995 April 15, 2025 05:16

jackye1995 reviewed Apr 15, 2025

View reviewed changes

yanghua changed the title ~~spec: create table~~ spec: register table Apr 21, 2025

github-actions bot added java Java features rust Rust features labels Apr 21, 2025

yanghua force-pushed the i-5 branch 4 times, most recently from 3595d47 to 3a90a76 Compare April 21, 2025 13:42

jackye1995 approved these changes Apr 21, 2025

View reviewed changes

yanghua added 3 commits April 22, 2025 08:15

spec: register table

c55f48d

spec: register table

ef0bf14

fix conflict

b0ab8f7

yanghua force-pushed the i-5 branch from eb5f215 to b0ab8f7 Compare April 22, 2025 00:18

codegen for python

5e74bae

github-actions bot added the python Python features label Apr 22, 2025

fix codegen issue

512f6e4

yanghua merged commit 59754f7 into lance-format:main Apr 22, 2025
8 checks passed

yanghua deleted the i-5 branch April 22, 2025 00:36

yanghua mentioned this pull request Apr 22, 2025

spec: register table #5

Closed

spec: register table #22

spec: register table #22

Uh oh!

Conversation

yanghua commented Apr 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Q1: Do we need to pass in actual table metadata in shape of JSON, or can we get around with just passing a path

Q2: Do we follow the LanceDB table metadata shape, or the Lance one

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua commented Apr 21, 2025

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

yanghua commented Apr 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jackye1995 Apr 15, 2025 •

edited

Loading

jackye1995 Apr 18, 2025 •

edited

Loading

jackye1995 Apr 15, 2025 •

edited

Loading