-
Notifications
You must be signed in to change notification settings - Fork 28
spec: drop table under ns #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| tags: [ Table ] | ||
| summary: Drop a table from the catalog | ||
| operationId: DropTable | ||
| parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was browsing through the history of IRC about this because I remember there were quite some controversies about this feature. I think there are 2 big reasons in summary,
(1) there is basically no way for the server to actually purge the data and give a signal back to the client that the purge operation actually succeeded. Because purge can take a long time, but this API has to return right away, even if there is a background job that is purging data, if that process fails half way (which happens a lot due to issues like permission), there is no way to report back to the client that the purge process has failed and some actions need to be taken.
There is a way to solve this issue, which is to introduce a transaction API for progress tracking, so the client can detect any issue with purging through that API. I think this would also be helpful for DropNamespace to allow cascade drop. If you agree with this direction, I can open a PR to add that API.
(2) in Iceberg, there is a property in the table that indicates the same thing called gc.enabled. And users expect that table property to dictate if the table should be purged or not during a drop table and it might get inconsistent with the API request (e.g. the table says the table needs to be purged, but user does not request purge). I don't know if there is a similar situation in Lance that we need to consider here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if there is a similar situation in Lance that we need to consider here.
AFAIK, Lance only follows the physical delete semantics when dropping a dataset.
There is a way to solve this issue, which is to introduce a transaction API for progress tracking, so the client can detect any issue with purging through that API.
You mean an async API?
I mean, there may be two requirements to make the async take effect:
- in catalog server(adapter service), we support async delete logic(e.g. delete ns as you said)
- For deleting a table, the lance SDK must support async delete logic
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only follows the physical delete semantics when dropping a dataset.
what do you mean by physical delete? You mean delete the entire dataset including data?
You mean an async API?
yes in general async API for the server, so operations like DropTable, DropNamespace could return a 202 accepted with an ID to check progress.
For a catalog, this ID is basically a transaction ID, which means to introduce APIs like
GET /transactions/{tid}to check for transaction progressDELETE /transactions/{tid}to cancel an ongoing transaction
For deleting a table, the lance SDK must support async delete logic
what do you mean by that? You mean Lance SDK deletes the dataset in storage in Spark, and call the catalog to just drop the table reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by physical delete? You mean delete the entire dataset including data?
Oh... sorry, this is a wrong description, I want to say sync delete.
what do you mean by that? You mean Lance SDK deletes the dataset in storage in Spark, and call the catalog to just drop the table reference?
I mean, currently, Lance only has a sync delete API, which only triggers deletion via object store. For deleting a lance table, is it not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry somehow missed responding to this thread...
yeah but the issue here is when the deletion is moved to the server side, the server is responsible for running this sync delete, and that will likely trigger a timeout on client side if the API is sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added a prototype for the transactions API under the context of DropNamespace first to see how that would look like in #59
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accidentally deleted the original fork which closed the PR... I reopened a new one in #63
close #49