-
Notifications
You must be signed in to change notification settings - Fork 28
spec: register table #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
spec/catalog.yaml
Outdated
| description: Optional write options for the table creation. | ||
| nullable: true | ||
|
|
||
| TableDefinition: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am a bit confused about this table definition part in general. Why is there a column definition list + a schema? And for the schema, I was originally expecting it to be resembling the lance.file.Field, but seems like it is not.
But before we get into those details, why not just passing a table location and an optional manifest location to create the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can refer to this entity: https://github.com/lancedb/lancedb/blob/24e42ccd4d9fe9644038d3dd77bd3206e9ccbdc2/rust/lancedb/src/database.rs#L125.
But before we get into those details, why not just passing a table location and an optional manifest location to create the table?
AFAIK, currently, lance can not create a dataset without a schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh so basically I mean you directly create the first manifest on client side, then you create a table object in catalog based on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean you directly create the first manifest on client side
You mean the client side is the environment of the HTTP client caller?
I mean, client just pass these params to the catalog server, it would call the lance/lancedb SDK to create a lance dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the client side is the environment of the HTTP client caller?
Yes
Seems like there are 2 topics to be unfolded here:
Q1: Do we need to pass in actual table metadata in shape of JSON, or can we get around with just passing a path
I was originally thinking about a very simple way to CreateTable, which is user creates the table + manifest on client side (e.g. in Spark) using the Lance library, and then just call the catalog to say "here is the table at some location, and optionally here is the latest manifest of this table".
For UpdateTable, the client side first produce the new manifest file for the Lance table. if the table is catalog managed (using the updated terminology in #11) , then the client side calls UpdateTable to update manifest location after the manfiest is produced at the client side. If the table is storage managed, then the client side does not need to engage with catalog at all since the next GetTable call will automatically resolve the table to the newly committed version.
Technically all these could work. If the whole purpose of the catalog is to let a compute engine know "here is a Lance table and it is at some location, go from there and use the Lance library to read or write it", I feel this direction is actually sufficient, it is also very simple since we just don't deal with any table metadata model at the catalog layer, there is no need to define any complicated JSON model
Another benefit is that, you don't need to deal with issues like the catalog service is using a different version of the Lance library from the client side to run the commit, causing any diverging behavior (which is actually an issue we have seen in IRC).
This is actually how HMS integrates with Iceberg for a very long time without the IRC middle layer.
But I don't know if that would defeat your purpose of having a catalog service, since I don't know what is your expected scope. And if your scope is larger, do you think this would be a good middle step, or do you think we should jump directly into defining all the table metadata shape?
Q2: Do we follow the LanceDB table metadata shape, or the Lance one
If we do want to let the catalog service perform all the actual table commit operations at service side, then we should discuss this question.
This seems like the main reason I was confused about the definition of TableDefinition, ColumnDefinition, etc. because these are the ones used in the LanceDB CreateTableRequest but I was expecting the model to be the ones that are feeded to create a new Lance dataset manifest. From code perspective, it is basically:
impl Manifest {
pub fn new(
schema: Schema, // this is the lance_core::datatypes::Schema
fragments: Arc<Vec<Fragment>>,
data_storage_format: DataStorageFormat,
blob_dataset_version: Option<u64>,
) and within these inputs we probably need just schema (fields + metadata) and data_storage_format
I feel that makes more sense for decoupling Lance and LanceDB, as it makes the catalog closer to the Lance side. Here we will use the Lance schema definition that is aligned with the protobuf shape, and it is not coupled with the concept like physical vs embedding, and we can always convert an Arrow schema to Lance schema through fn try_from(schema: &ArrowSchema).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure let's do that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about how Spark would integrate with this, when integrated with engine, the flag exposed to users would be something like "create the table on the client side or server side", consider the following:
@Override
public Table createTable(
Identifier ident, StructType schema, Transform[] partitions, Map<String, String> properties)
throws TableAlreadyExistsException, NoSuchNamespaceException {
try {
if (createTableAtClientSide) { // derived from Spark config
LanceConfig config = LanceConfig.from(options, ident.name());
WriteParams params = SparkOptions.genWriteParamsFromConfig(config);
LanceDatasetAdapter.createDataset(ident.name(), schema, params);
RegisterTableRequest req = new RegisterTableRequest();
req.setName(...);
req.setLocation(...);
tableApi.registerTable(req);
} else {
CreateTableRequest req = new CreateTableRequest();
req.setName(...);
req.setLocation(...);
req.setSchema(...);
tableApi.createTable(req);
}
} catch (IllegalArgumentException e) {
throw new TableAlreadyExistsException(ident.name(), new Some<>(e));
} catch (ApiException e) {
...
}
return new LanceDataset(LanceConfig.from(options, ident.name()), schema);
}It still feels more natural to me to just call CreateTable API instead of calling 2 different APIs under the createTable connector logic:
@Override
public Table createTable(
Identifier ident, StructType schema, Transform[] partitions, Map<String, String> properties)
throws TableAlreadyExistsException, NoSuchNamespaceException {
try {
CreateTableRequest req = new CreateTableRequest();
req.setName(...);
req.setLocation(...);
if (createTableAtClientSide) { // derived from Spark config
LanceConfig config = LanceConfig.from(options, ident.name());
WriteParams params = SparkOptions.genWriteParamsFromConfig(config);
LanceDatasetAdapter.createDataset(ident.name(), schema, params);
req.setRegisterOnly(true);
} else {
req.setSchema(...);
req.setRegisterOnly(false);
}
tableApi.createTable(req);
} catch (IllegalArgumentException e) {
throw new TableAlreadyExistsException(ident.name(), new Some<>(e));
} catch (ApiException e) {
...
}
return new LanceDataset(LanceConfig.from(options, ident.name()), schema);
}It might be another argument for having just CreateTable, but maybe it's just personal preference at this point, and it could be argued both ways. We have stayed on this topic for quite long, let's just pick an option and move forward 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about createTable in spark integration only consider using create table interface spec? Do not consider register table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just pick an option and move forward
Let's change this PR to register table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure let's do that
spec/catalog.yaml
Outdated
| mode: | ||
| type: string | ||
| enum: [ CREATE, EXIST_OK, OVERWRITE ] | ||
| write_options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use case for this? What could be an example write option? And if we have write option, why not having other things like read options? Feels like we should just have a generic map of options/properties, which is what you return in response anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use posix protocol, the write option could be ignored. If we use object storage, we need a write option to package some config options, e.g. ak, sk, endpoint, region... Because we need to create a lance dataset(write the first version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually an interesting perspective I never thought about. In IRC we did not have any configuration like that, and it's basically assumed that the service will figure out the credentials and other storage configs based on the table location user passes in. Either the caller intrinsiccally has the permission (e.g. using Amazon SigV4 auth and then create table in S3), or the service has some other mechanism outside for user to configure access permissions. But in general it feels to me that passing ak and sk directly in request is not a good security practice since that is essentially displayed in plain text on the receiver end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in general it feels to me that passing ak and sk directly in request is not a good security practice since that is essentially displayed in plain text on the receiver end.
Yes, in terms of security practice, it is true that static and clear text AK, SK should not be passed.
But using this API, users can also receive session token(assume role), some other write configuration. I just exposed an example.
it's basically assumed that the service will figure out the credentials and other storage configs based on the table location user passes in.
Yes, but the capacities are provided in some business catalog adapter, right? We will integrate the lance catalog with our business catalog, and it would provide these features.
But, IMO, we'd have a way to let users pass general params to create a lance dataset, right?
spec/catalog.yaml
Outdated
| - name | ||
| properties: | ||
| name: { type: string } | ||
| properties: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit, we might want to make the terminology consistent. In Database we used options but here we are using properties. In general I think properties is a bit more preferable, because there are SQL key words like DBPROPERTIES, TBLPROPERTIES. And it gives a feeling that it is intrinsic to the object. What do your think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the option, do you mean write_option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in CreateNamespaceRequest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trade-off I think, is that if we want to align with the naming convention of lance/lancedb. It uses some options in their public APIs. I am open about it.
3595d47 to
3a90a76
Compare
|
cc @jackye1995 |
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me!
|
@jackye1995 Does the codegen check ci have any legacy issues? I ran |
No description provided.