Skip to content

Conversation

@yanghua
Copy link
Collaborator

@yanghua yanghua commented Apr 22, 2025

close #49

@yanghua yanghua requested a review from jackye1995 April 22, 2025 14:37
@github-actions github-actions bot added enhancement New feature or request python Python features java Java features spec Restful openapi spec rust Rust features labels Apr 22, 2025
tags: [ Table ]
summary: Drop a table from the catalog
operationId: DropTable
parameters:
Copy link
Collaborator

@jackye1995 jackye1995 Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was browsing through the history of IRC about this because I remember there were quite some controversies about this feature. I think there are 2 big reasons in summary,

(1) there is basically no way for the server to actually purge the data and give a signal back to the client that the purge operation actually succeeded. Because purge can take a long time, but this API has to return right away, even if there is a background job that is purging data, if that process fails half way (which happens a lot due to issues like permission), there is no way to report back to the client that the purge process has failed and some actions need to be taken.

There is a way to solve this issue, which is to introduce a transaction API for progress tracking, so the client can detect any issue with purging through that API. I think this would also be helpful for DropNamespace to allow cascade drop. If you agree with this direction, I can open a PR to add that API.

(2) in Iceberg, there is a property in the table that indicates the same thing called gc.enabled. And users expect that table property to dictate if the table should be purged or not during a drop table and it might get inconsistent with the API request (e.g. the table says the table needs to be purged, but user does not request purge). I don't know if there is a similar situation in Lance that we need to consider here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there is a similar situation in Lance that we need to consider here.

AFAIK, Lance only follows the physical delete semantics when dropping a dataset.

There is a way to solve this issue, which is to introduce a transaction API for progress tracking, so the client can detect any issue with purging through that API.

You mean an async API?

I mean, there may be two requirements to make the async take effect:

  • in catalog server(adapter service), we support async delete logic(e.g. delete ns as you said)
  • For deleting a table, the lance SDK must support async delete logic

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only follows the physical delete semantics when dropping a dataset.

what do you mean by physical delete? You mean delete the entire dataset including data?

You mean an async API?

yes in general async API for the server, so operations like DropTable, DropNamespace could return a 202 accepted with an ID to check progress.

For a catalog, this ID is basically a transaction ID, which means to introduce APIs like

  • GET /transactions/{tid} to check for transaction progress
  • DELETE /transactions/{tid} to cancel an ongoing transaction

For deleting a table, the lance SDK must support async delete logic

what do you mean by that? You mean Lance SDK deletes the dataset in storage in Spark, and call the catalog to just drop the table reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by physical delete? You mean delete the entire dataset including data?

Oh... sorry, this is a wrong description, I want to say sync delete.

what do you mean by that? You mean Lance SDK deletes the dataset in storage in Spark, and call the catalog to just drop the table reference?

I mean, currently, Lance only has a sync delete API, which only triggers deletion via object store. For deleting a lance table, is it not enough?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry somehow missed responding to this thread...

yeah but the issue here is when the deletion is moved to the server side, the server is responsible for running this sync delete, and that will likely trigger a timeout on client side if the API is sync.

Copy link
Collaborator

@jackye1995 jackye1995 Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added a prototype for the transactions API under the context of DropNamespace first to see how that would look like in #59

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accidentally deleted the original fork which closed the PR... I reopened a new one in #63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java Java features python Python features rust Rust features spec Restful openapi spec

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spec: drop table under ns

2 participants