From b36f654624021ed43910f8fd361840a3f2999787 Mon Sep 17 00:00:00 2001 From: prrao87 Date: Fri, 2 Jan 2026 09:56:53 -0500 Subject: [PATCH 1/2] Update docs to latest version 0.4.1 --- docs/src/integrations/duckdb.md | 405 +++++++++++++++++++++++--------- 1 file changed, 293 insertions(+), 112 deletions(-) diff --git a/docs/src/integrations/duckdb.md b/docs/src/integrations/duckdb.md index febf71d37ac..d7387052851 100644 --- a/docs/src/integrations/duckdb.md +++ b/docs/src/integrations/duckdb.md @@ -4,78 +4,132 @@ Lance datasets can be queried in SQL with [DuckDB](https://duckdb.org/), an in-process OLAP relational database. Using DuckDB means you can write complex SQL queries (that may not yet be supported in Lance), without needing to move your data out of Lance. !!! note - This integration is done via a DuckDB extension, whose source code is available + This integration is done via a DuckDB extension, whose source code and latest documentation (via `README.md`) is available [here](https://github.com/lance-format/lance-duckdb). - To ensure you see the latest examples and syntax, check out the - [DuckDB extension](https://duckdb.org/community_extensions/extensions/lance) + To ensure you see the most up-to-date examples and syntax, check out the repo and the + [DuckDB extension](https://duckdb.org/community_extensions/extensions/lance) documentation page. -## Usage: Python +## Installation -### Install dependencies +### Python dependencies -Install Lance, DuckDB and Pyarrow and follow the examples below. +- To use DuckDB's CLI, install it using the steps shown in [their docs](https://duckdb.org/install/). +- To run the code in Python, install Lance, DuckDB and PyArrow as shown below. ```bash pip install pylance duckdb pyarrow ``` -### Add data to a Lance dataset +### Install the Lance extension in DuckDB -Let's add some data to a Lance dataset. +We're now ready to begin querying Lance using DuckDB! First, install the extension. -```python -import lance -import pyarrow as pa +=== "SQL" -data = [ - {"animal": "duck", "noise": "quack", "vector": [0.9, 0.7, 0.1]}, - {"animal": "horse", "noise": "neigh", "vector": [0.3, 0.1, 0.5]}, - {"animal": "dragon", "noise": "roar", "vector": [0.5, 0.2, 0.7]}, -] -pa_table = pa.Table.from_pylist(data) + ```sql + INSTALL lance FROM community; + LOAD lance; + ``` -lance_path = "./lance_duck.lance" -ds = lance.write_dataset(pa_table, lance_path, mode="overwrite") -``` +=== "Python" -This will store the Lance dataset to the specified local path. + ```python + import duckdb -### Install Lance extension in DuckDB + duckdb.sql( + """ + INSTALL lance FROM community; + LOAD lance; + """ + ) + ``` -Install the Lance extension in DuckDB as follows. +???+ info "Update extensions" + If you already have the extension installed locally, run the following command to update it to the + latest version: + ``` + UPDATE EXTENSIONS; + ``` -```python -import duckdb +## Examples -duckdb.sql( - """ - INSTALL lance FROM community; - LOAD lance; - """ -) -``` +All examples below reuse a small dataset with three rows (duck, horse, dragon) +and a `vector` column with representative values. In the real world, you'd have +a high-dimensional array generated by an embedding model, and a much larger Lance dataset. + +### Write a DuckDB table as a Lance dataset + +Use DuckDB's `COPY ... TO ...` to materialize query results as a Lance dataset. + +=== "SQL" + + ```sql + COPY ( + SELECT * + FROM ( + VALUES + ('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]), + ('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]), + ('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[]) + ) AS t(animal, noise, vector) + ) TO './lance_duck.lance' (FORMAT lance, mode 'overwrite'); + ``` + +=== "Python" + + ```python + import duckdb + + duckdb.sql( + """ + COPY ( + SELECT * + FROM ( + VALUES + ('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]), + ('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]), + ('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[]) + ) AS t(animal, noise, vector) + ) TO './lance_duck.lance' (FORMAT lance, mode 'overwrite'); + """ + ) + ``` + +### Query a Lance dataset from DuckDB -### Query a `*.lance` path +Now that the Lance dataset is written, let's query it using SQL in DuckDB. -You're now ready to query the Lance dataset using SQL! +=== "SQL" -```python -# Get results from Lance in DuckDB! -r1 = duckdb.sql( - """ + ```sql SELECT * - FROM './lance_duck.lance' - LIMIT 5; - """ -) -print(r1) -``` + FROM './lance_duck.lance' + LIMIT 5; + ``` + +=== "Python" + + ```python + import duckdb + + r1 = duckdb.sql( + """ + SELECT * + FROM './lance_duck.lance' + LIMIT 5; + """ + ) + print(r1) + ``` + + This returns: + ``` ┌─────────┬─────────┬─────────────────┐ │ animal │ noise │ vector │ -│ varchar │ varchar │ double[] │ +│ varchar │ varchar │ float[] │ ├─────────┼─────────┼─────────────────┤ │ duck │ quack │ [0.9, 0.7, 0.1] │ │ horse │ neigh │ [0.3, 0.1, 0.5] │ @@ -84,133 +138,260 @@ This returns: ``` ???+ info "Query S3 paths directly" - You can also query `s3://` paths directly. To do this, you can use DuckDB's native secrets mechanism to provide credentials. + To access object store URIs (such as `s3://...`), configure a `TYPE LANCE` secret. ```sql - r1 = duckdb.sql( - """ - CREATE SECRET (TYPE S3, provider credential_chain); + CREATE SECRET ( + TYPE LANCE, + PROVIDER credential_chain, + SCOPE 's3://bucket/' + ); - SELECT * - FROM 's3://bucket/path/to/dataset.lance' - LIMIT 5; + SELECT * + FROM 's3://bucket/path/to/dataset.lance' + LIMIT 5; + ``` + +### Create a Lance dataset via CREATE TABLE (directory namespace) + +When you `ATTACH` a directory as a Lance namespace, you can create new datasets +using `CREATE TABLE` or `CREATE TABLE AS SELECT`. The dataset is written to +`/.lance`. + +=== "SQL" + + ```sql + ATTACH './lance_ns' AS lance_ns (TYPE LANCE); + + CREATE TABLE lance_ns.main.duck_animals AS + SELECT * + FROM ( + VALUES + ('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]), + ('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]), + ('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[]) + ) AS t(animal, noise, vector); + ``` + +=== "Python" + + ```python + import duckdb + + duckdb.sql( + """ + ATTACH './lance_ns' AS lance_ns (TYPE LANCE); + + CREATE TABLE lance_ns.main.duck_animals AS + SELECT * + FROM ( + VALUES + ('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]), + ('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]), + ('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[]) + ) AS t(animal, noise, vector); """ ) ``` -### Search +You can then query the namespace as follows: -The extension exposes lance_search(...) as a unified entry point for: - -- Vector search (KNN / ANN) -- Full-text search (FTS) -- Hybrid search (vector + FTS) +```sql +SELECT count(*) FROM lance_ns.main.duck_animals; +``` -!!! warning - DuckDB treats `column` as a keyword in some contexts. It's recommended to - use `text_column` / `vector_column` as column names for the Lance extension. +``` +┌──────────────┐ +│ count_star() │ +│ int64 │ +├──────────────┤ +│ 3 │ +└──────────────┘ +``` -#### Vector search +### Vector search You can perform vector search on a column. This returns the `_distance` -(smaller is closer, so sort in ascending order for nearest neighbors). +(smaller is closer, so sort in ascending order for nearest neighbors). The example vector here is similar to the query "duck". -```python -# Show results similar to "the duck goes quack" -q2 = [0.8, 0.7, 0.2] +=== "SQL" -r2 = duckdb.sql( - """ - SELECT animal, noise, vector - FROM lance_vector_search( + ```sql + SELECT animal, noise, vector, _distance + FROM lance_vector_search( './lance_duck.lance', 'vector', - q2::FLOAT[], + [0.8, 0.7, 0.2]::FLOAT[], k = 1, prefilter = true + ) + ORDER BY _distance ASC; + ``` + +=== "Python" + + ```python + import duckdb + + r2 = duckdb.sql( + """ + SELECT animal, noise, vector, _distance + FROM lance_vector_search( + './lance_duck.lance', + 'vector', + [0.8, 0.7, 0.2]::FLOAT[], + k = 1, + prefilter = true + ) + ORDER BY _distance ASC; + """ ) - ORDER BY _distance ASC; - """ -) -print(r2) -``` + print(r2) + ``` + This returns: ``` ┌─────────┬─────────┬─────────────────┐ │ animal │ noise │ vector │ -│ varchar │ varchar │ double[] │ +│ varchar │ varchar │ float[] │ ├─────────┼─────────┼─────────────────┤ │ duck │ quack │ [0.9, 0.7, 0.1] │ └─────────┴─────────┴─────────────────┘ ``` -#### Full-text search (FTS) +### Full-text search Run keyword-based BM25 search as shown below. This returns a `_score`, which is sorted in descending order to get the most relevant results. -```python -# Show results for the query "the brave knight faced the dragon" -r3 = duckdb.sql( - """ - SELECT animal, noise, vector - FROM lance_fts( +=== "SQL" + + ```sql + SELECT animal, noise, vector, _score + FROM lance_fts( './lance_duck.lance', 'animal', 'the brave knight faced the dragon', k = 1, - prefilter = true) - ORDER BY _score DESC; - """ -) -print(r3) -``` + prefilter = true + ) + ORDER BY _score DESC; + ``` + +=== "Python" + + ```python + import duckdb + + r3 = duckdb.sql( + """ + SELECT animal, noise, vector, _score + FROM lance_fts( + './lance_duck.lance', + 'animal', + 'the brave knight faced the dragon', + k = 1, + prefilter = true + ) + ORDER BY _score DESC; + """ + ) + print(r3) + ``` + This returns: + ``` ┌─────────┬─────────┬─────────────────┐ │ animal │ noise │ vector │ -│ varchar │ varchar │ double[] │ +│ varchar │ varchar │ float[] │ ├─────────┼─────────┼─────────────────┤ │ dragon │ roar │ [0.5, 0.2, 0.7] │ └─────────┴─────────┴─────────────────┘ ``` -#### Hybrid search +### Hybrid search Hybrid search combines vector and FTS scores, returning a `_hybrid_score` in addition to `_distance` / `_score`. To get the most relevant results, sort in descending order. -```python -# Show results similar to "the duck surprised the dragon" -q4 = [0.8, 0.7, 0.2] +=== "SQL" -r4 = duckdb.sql( - """ - SELECT animal, noise, vector - FROM lance_hybrid_search( + ```sql + SELECT animal, noise, vector, _hybrid_score, _distance, _score + FROM lance_hybrid_search( './lance_duck.lance', - 'vector', q4::FLOAT[], - 'animal', 'the duck surprised the dragon', + 'vector', + [0.8, 0.7, 0.2]::FLOAT[], + 'animal', + 'the duck surprised the dragon', k = 2, - prefilter = true + prefilter = false, + alpha = 0.5, + oversample_factor = 4 + ) + ORDER BY _hybrid_score DESC; + ``` + +=== "Python" + + ```python + import duckdb + + r4 = duckdb.sql( + """ + SELECT animal, noise, vector, _hybrid_score, _distance, _score + FROM lance_hybrid_search( + './lance_duck.lance', + 'vector', + [0.8, 0.7, 0.2]::FLOAT[], + 'animal', + 'the duck surprised the dragon', + k = 2, + prefilter = false, + alpha = 0.5, + oversample_factor = 4 + ) + ORDER BY _hybrid_score DESC; + """ ) - ORDER BY _hybrid_score DESC; - """ -) -print(r4) -``` -This should give: + print(r4) + ``` + +This returns: ``` ┌─────────┬─────────┬─────────────────┐ │ animal │ noise │ vector │ -│ varchar │ varchar │ double[] │ +│ varchar │ varchar │ float[] │ ├─────────┼─────────┼─────────────────┤ │ duck │ quack │ [0.9, 0.7, 0.1] │ │ dragon │ roar │ [0.5, 0.2, 0.7] │ └─────────┴─────────┴─────────────────┘ ``` -## Usage: DuckDB CLI +!!! warning + DuckDB treats `column` as a keyword in some contexts. It's recommended to + use `text_column` / `vector_column` as column names for the Lance extension. + +## Source repo + +Check out the [lance-duckdb](https://github.com/lance-format/lance-duckdb) project +for the latest source code, and go through `README.md` for the latest API docs. +Additional pages are listed below. + +### Full SQL reference + +[sql.md](https://github.com/lance-format/lance-duckdb/blob/main/docs/sql.md) +lists the current SQL surface supported by this extension. It's recommended to refer +to this page for the most up-to-date information. + +### Cloud storage reference + +[cloud.md](https://github.com/lance-format/lance-duckdb/blob/main/docs/cloud.md) lists +the current supported backends that allow you to access data on various cloud providers. -DuckDB comes with a CLI that makes it easy to run SQL queries in the terminal. -Check out the [DuckDB extension](https://duckdb.org/community_extensions/extensions/lance) documentation page for examples using the DuckDB CLI. +- S3 / S3-compatible: `s3://...` (also accepts `s3a://...` and `s3n://...`, normalized to `s3://...`) +- Google Cloud Storage: `gs://...` +- Azure Blob Storage: `az://...` +- Alibaba Cloud OSS: `oss://...` +- Hugging Face Hub (OpenDAL): `hf://...` \ No newline at end of file From 04dc41f70b403aecfe9e64ed6eaea83e7a72955a Mon Sep 17 00:00:00 2001 From: prrao87 Date: Fri, 2 Jan 2026 10:21:12 -0500 Subject: [PATCH 2/2] Add trailing newline --- docs/src/integrations/duckdb.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/integrations/duckdb.md b/docs/src/integrations/duckdb.md index d7387052851..cac9b048bea 100644 --- a/docs/src/integrations/duckdb.md +++ b/docs/src/integrations/duckdb.md @@ -394,4 +394,4 @@ the current supported backends that allow you to access data on various cloud pr - Google Cloud Storage: `gs://...` - Azure Blob Storage: `az://...` - Alibaba Cloud OSS: `oss://...` -- Hugging Face Hub (OpenDAL): `hf://...` \ No newline at end of file +- Hugging Face Hub (OpenDAL): `hf://...`