Skip to content

Conversation

@yanghua
Copy link
Collaborator

@yanghua yanghua commented Apr 15, 2025

No description provided.

@github-actions github-actions bot added the spec Restful openapi spec label Apr 15, 2025
@yanghua yanghua changed the title codegen: create table spec: create table Apr 15, 2025
@github-actions github-actions bot added the enhancement New feature or request label Apr 15, 2025
@yanghua yanghua marked this pull request as ready for review April 15, 2025 05:16
@yanghua yanghua requested a review from jackye1995 April 15, 2025 05:16
description: Optional write options for the table creation.
nullable: true

TableDefinition:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am a bit confused about this table definition part in general. Why is there a column definition list + a schema? And for the schema, I was originally expecting it to be resembling the lance.file.Field, but seems like it is not.

But before we get into those details, why not just passing a table location and an optional manifest location to create the table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to this entity: https://github.com/lancedb/lancedb/blob/24e42ccd4d9fe9644038d3dd77bd3206e9ccbdc2/rust/lancedb/src/database.rs#L125.

But before we get into those details, why not just passing a table location and an optional manifest location to create the table?

AFAIK, currently, lance can not create a dataset without a schema?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh so basically I mean you directly create the first manifest on client side, then you create a table object in catalog based on that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you directly create the first manifest on client side

You mean the client side is the environment of the HTTP client caller?

I mean, client just pass these params to the catalog server, it would call the lance/lancedb SDK to create a lance dataset.

Copy link
Collaborator

@jackye1995 jackye1995 Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the client side is the environment of the HTTP client caller?

Yes

Seems like there are 2 topics to be unfolded here:

Q1: Do we need to pass in actual table metadata in shape of JSON, or can we get around with just passing a path

I was originally thinking about a very simple way to CreateTable, which is user creates the table + manifest on client side (e.g. in Spark) using the Lance library, and then just call the catalog to say "here is the table at some location, and optionally here is the latest manifest of this table".

For UpdateTable, the client side first produce the new manifest file for the Lance table. if the table is catalog managed (using the updated terminology in #11) , then the client side calls UpdateTable to update manifest location after the manfiest is produced at the client side. If the table is storage managed, then the client side does not need to engage with catalog at all since the next GetTable call will automatically resolve the table to the newly committed version.

Technically all these could work. If the whole purpose of the catalog is to let a compute engine know "here is a Lance table and it is at some location, go from there and use the Lance library to read or write it", I feel this direction is actually sufficient, it is also very simple since we just don't deal with any table metadata model at the catalog layer, there is no need to define any complicated JSON model

Another benefit is that, you don't need to deal with issues like the catalog service is using a different version of the Lance library from the client side to run the commit, causing any diverging behavior (which is actually an issue we have seen in IRC).

This is actually how HMS integrates with Iceberg for a very long time without the IRC middle layer.

But I don't know if that would defeat your purpose of having a catalog service, since I don't know what is your expected scope. And if your scope is larger, do you think this would be a good middle step, or do you think we should jump directly into defining all the table metadata shape?

Q2: Do we follow the LanceDB table metadata shape, or the Lance one

If we do want to let the catalog service perform all the actual table commit operations at service side, then we should discuss this question.

This seems like the main reason I was confused about the definition of TableDefinition, ColumnDefinition, etc. because these are the ones used in the LanceDB CreateTableRequest but I was expecting the model to be the ones that are feeded to create a new Lance dataset manifest. From code perspective, it is basically:

impl Manifest {
    pub fn new(
        schema: Schema, // this is the lance_core::datatypes::Schema
        fragments: Arc<Vec<Fragment>>,
        data_storage_format: DataStorageFormat,
        blob_dataset_version: Option<u64>,
    ) 

and within these inputs we probably need just schema (fields + metadata) and data_storage_format

I feel that makes more sense for decoupling Lance and LanceDB, as it makes the catalog closer to the Lance side. Here we will use the Lance schema definition that is aligned with the protobuf shape, and it is not coupled with the concept like physical vs embedding, and we can always convert an Arrow schema to Lance schema through fn try_from(schema: &ArrowSchema).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure let's do that!

Copy link
Collaborator

@jackye1995 jackye1995 Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about how Spark would integrate with this, when integrated with engine, the flag exposed to users would be something like "create the table on the client side or server side", consider the following:

@Override
  public Table createTable(
      Identifier ident, StructType schema, Transform[] partitions, Map<String, String> properties)
      throws TableAlreadyExistsException, NoSuchNamespaceException {
    try {
      if (createTableAtClientSide) { // derived from Spark config
          LanceConfig config = LanceConfig.from(options, ident.name());
          WriteParams params = SparkOptions.genWriteParamsFromConfig(config);
          LanceDatasetAdapter.createDataset(ident.name(), schema, params);
    
          RegisterTableRequest req = new RegisterTableRequest();
          req.setName(...);
          req.setLocation(...);
          tableApi.registerTable(req);
      } else {
          CreateTableRequest req = new CreateTableRequest();
          req.setName(...);
          req.setLocation(...);
          req.setSchema(...);
          tableApi.createTable(req);
      }

    } catch (IllegalArgumentException e) {
      throw new TableAlreadyExistsException(ident.name(), new Some<>(e));
    } catch (ApiException e) {
      ...
    }
    return new LanceDataset(LanceConfig.from(options, ident.name()), schema);
  }

It still feels more natural to me to just call CreateTable API instead of calling 2 different APIs under the createTable connector logic:

@Override
  public Table createTable(
      Identifier ident, StructType schema, Transform[] partitions, Map<String, String> properties)
      throws TableAlreadyExistsException, NoSuchNamespaceException {
    try {

      CreateTableRequest req = new CreateTableRequest();
      req.setName(...);
      req.setLocation(...);
      if (createTableAtClientSide) { // derived from Spark config
          LanceConfig config = LanceConfig.from(options, ident.name());
          WriteParams params = SparkOptions.genWriteParamsFromConfig(config);
          LanceDatasetAdapter.createDataset(ident.name(), schema, params);
          req.setRegisterOnly(true);
      } else {
          req.setSchema(...);
          req.setRegisterOnly(false);
      }
      tableApi.createTable(req);

    } catch (IllegalArgumentException e) {
      throw new TableAlreadyExistsException(ident.name(), new Some<>(e));
    } catch (ApiException e) {
      ...
    }
    return new LanceDataset(LanceConfig.from(options, ident.name()), schema);
  }

It might be another argument for having just CreateTable, but maybe it's just personal preference at this point, and it could be argued both ways. We have stayed on this topic for quite long, let's just pick an option and move forward 😃

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about createTable in spark integration only consider using create table interface spec? Do not consider register table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just pick an option and move forward

Let's change this PR to register table?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure let's do that

mode:
type: string
enum: [ CREATE, EXIST_OK, OVERWRITE ]
write_options:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for this? What could be an example write option? And if we have write option, why not having other things like read options? Feels like we should just have a generic map of options/properties, which is what you return in response anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use posix protocol, the write option could be ignored. If we use object storage, we need a write option to package some config options, e.g. ak, sk, endpoint, region... Because we need to create a lance dataset(write the first version).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually an interesting perspective I never thought about. In IRC we did not have any configuration like that, and it's basically assumed that the service will figure out the credentials and other storage configs based on the table location user passes in. Either the caller intrinsiccally has the permission (e.g. using Amazon SigV4 auth and then create table in S3), or the service has some other mechanism outside for user to configure access permissions. But in general it feels to me that passing ak and sk directly in request is not a good security practice since that is essentially displayed in plain text on the receiver end.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in general it feels to me that passing ak and sk directly in request is not a good security practice since that is essentially displayed in plain text on the receiver end.

Yes, in terms of security practice, it is true that static and clear text AK, SK should not be passed.

But using this API, users can also receive session token(assume role), some other write configuration. I just exposed an example.

it's basically assumed that the service will figure out the credentials and other storage configs based on the table location user passes in.

Yes, but the capacities are provided in some business catalog adapter, right? We will integrate the lance catalog with our business catalog, and it would provide these features.

But, IMO, we'd have a way to let users pass general params to create a lance dataset, right?

- name
properties:
name: { type: string }
properties:
Copy link
Collaborator

@jackye1995 jackye1995 Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit, we might want to make the terminology consistent. In Database we used options but here we are using properties. In general I think properties is a bit more preferable, because there are SQL key words like DBPROPERTIES, TBLPROPERTIES. And it gives a feeling that it is intrinsic to the object. What do your think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the option, do you mean write_option?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in CreateNamespaceRequest

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trade-off I think, is that if we want to align with the naming convention of lance/lancedb. It uses some options in their public APIs. I am open about it.

@yanghua yanghua changed the title spec: create table spec: register table Apr 21, 2025
@github-actions github-actions bot added java Java features rust Rust features labels Apr 21, 2025
@yanghua yanghua force-pushed the i-5 branch 4 times, most recently from 3595d47 to 3a90a76 Compare April 21, 2025 13:42
@yanghua
Copy link
Collaborator Author

yanghua commented Apr 21, 2025

cc @jackye1995

Copy link
Collaborator

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

@github-actions github-actions bot added the python Python features label Apr 22, 2025
@yanghua
Copy link
Collaborator Author

yanghua commented Apr 22, 2025

@jackye1995 Does the codegen check ci have any legacy issues? I ran make gen under python dir? It checks the diff failed:

-test/test_get_table_response.py
-test/test_register_table_request.py
-test/test_table_api.py

@yanghua yanghua merged commit 59754f7 into lance-format:main Apr 22, 2025
8 checks passed
@yanghua yanghua deleted the i-5 branch April 22, 2025 00:36
@yanghua yanghua mentioned this pull request Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java Java features python Python features rust Rust features spec Restful openapi spec

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants