Skip to content

Commit d7504c2

Browse files
authored
feat: add RTree index spec in table format (#5360)
This PR proposes adding the R-Tree index specification to the Lance table format. For implementation details please see #5034 Feel free to leave comments or share feedback
1 parent fb7f8dd commit d7504c2

File tree

3 files changed

+128
-2
lines changed

3 files changed

+128
-2
lines changed

docs/src/format/table/index/scalar/.pages

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,4 @@ nav:
77
- Bloom Filter: bloom_filter.md
88
- Full Text Search: fts.md
99
- N-gram: ngram.md
10-
10+
- RTree: rtree.md
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# R-Tree Index
2+
3+
The R-Tree index is a static, immutable 2D spatial index. It is built on bounding boxes to organize the data. This index is intended to accelerate rectangle-based pruning.
4+
5+
It is designed as a multi-level hierarchical structure: leaf pages store tuples `(bbox, id=rowid)` for indexed geometries; branch pages aggregate child bounding boxes and store `id=pageid` pointing to child pages; a single root page encloses the entire tree. Conceptually, it can be thought of as an extension of the B+-tree to multidimensional objects, where bounding boxes act as keys for spatial pruning.
6+
7+
The index uses a packed-build strategy where items are first sorted and then grouped into fixed-size leaf pages.
8+
9+
This packed-build flow is:
10+
- Sort items (bboxes) according to the sorting algorithm.
11+
- Pack consecutive items into leaf pages of `page_size` entries; then build parent pages bottom-up by aggregating child page bboxes.
12+
13+
## Sorting
14+
15+
Sorting does not change the R-Tree data structure, but it is critical to performance. Currently, Hilbert sorting is implemented, but the design is extensible to other spatial sorting algorithms.
16+
17+
### Hilbert Curve Sorting
18+
19+
Hilbert sorting imposes a linear order on 2D items using a space-filling Hilbert curve to maximize locality in both axes. This improves leaf clustering, which benefits query pruning.
20+
21+
Hilbert sorting is performed in three steps:
22+
23+
1. **Global bounding box**: compute the global bbox `[xmin_g, ymin_g, xmax_g, ymax_g]` over all items for training index.
24+
2. **Normalize and compute Hilbert value**:
25+
- For each item bbox `[xmin_i, ymin_i, xmax_i, ymax_i]`, compute its center:
26+
- `cx = (xmin_i + xmax_i) / 2`
27+
- `cy = (ymin_i + ymax_i) / 2`
28+
- Map the center to a 16‑bit grid per axis using the global bbox. Let `W = xmax_g - xmin-g` and `H = ymax_g - ymin_g`. The normalized integer coordinates are:
29+
- `xi = round(((cx - xmin_g) / W) * (2^16 - 1))`
30+
- `yi = round(((cy - ymin_g) / H) * (2^16 - 1))`
31+
- If the global width or height is effectively zero, the corresponding axis is treated as degenerate and set to `0` for all items (the ordering then degenerates to 1D on the other axis).
32+
- For each `(xi, yi)` in `[0 .. 2^16-1] × [0 .. 2^16-1]`, compute a 32‑bit Hilbert value using a standard 2D Hilbert algorithm. In pseudocode (with `bits = 16`):
33+
```
34+
fn hilbert_value(x, y, bits):
35+
# x, y: integers in [0 .. 2^bits - 1]
36+
h = 0
37+
mask = (1 << bits) - 1
38+
39+
for s from bits-1 down to 0:
40+
rx = (x >> s) & 1
41+
ry = (y >> s) & 1
42+
d = ((3 * rx) XOR ry) << (2 * s)
43+
h = h | d
44+
45+
if ry == 0:
46+
if rx == 1:
47+
x = (~x) & mask
48+
y = (~y) & mask
49+
swap(x, y)
50+
51+
return h
52+
```
53+
- The resulting `h` is stored as the item’s Hilbert value (type `u32` with `bits = 16`).
54+
3. **Sort**: sort items by Hilbert value.
55+
56+
## Index Details
57+
58+
```protobuf
59+
%%% proto.message.RTreeIndexDetails %%%
60+
```
61+
62+
## Storage Layout
63+
64+
The R-Tree index consists of two files:
65+
66+
1. `page_data.lance` - Stores all pages (leaf, branch) as repeated `(bbox, id)` tuples, written bottom-up (leaves first, then branch levels)
67+
2. `nulls.lance` - Stores a serialized RowIdTreeMap of rows with null
68+
69+
### Page File Schema
70+
71+
| Column | Type | Nullable | Description |
72+
|:-------|:---------|:---------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
73+
| `bbox` | RectType | false | Type is Rect defined by [geoarrow-rs](https://github.com/geoarrow/geoarrow-rs) RectType; physical storage is Struct<xmin: Float64, ymin: Float64, xmax: Float64, ymax: Float64>. Represents the node bounding box (leaf: item bbox; branch: child aggregation). |
74+
| `id` | UInt64 | false | Reuse the `id` column to store `rowid` in leaf pages and `pageid` in branch pages |
75+
76+
### Nulls File Schema
77+
78+
| Column | Type | Nullable | Description |
79+
|:--------|:-------|:---------|:-----------------------------------------------------------|
80+
| `nulls` | Binary | false | Serialized RowIdTreeMap of rows with null/invalid geometry |
81+
82+
### Schema Metadata
83+
84+
The following optional keys can be used by implementations and are stored in the schema metadata:
85+
86+
| Key | Type | Description |
87+
|:------------|:-------|:--------------------------------------------------|
88+
| `page_size` | String | Page size per page |
89+
| `num_pages` | String | Total number of pages written |
90+
| `num_items` | String | Number of non-null leaf items in the index |
91+
| `bbox` | String | JSON-serialized global BoundingBox of the dataset |
92+
93+
### Query Traversal
94+
95+
This index serializes the multi-level hierarchical RTree structure into a single page file following the schema above. At lookup time, the reader computes each page offset using the algorithm below and reconstructs the hierarchy for traversal.
96+
97+
Offsets are derived from `num_items` and `page_size` of metadata as follows:
98+
99+
- Leaf: `leaf_pages = ceil(num_items / page_size)`; leaf `i` has `page_offset = i * page_size`.
100+
- Branch: let `level_offset` be the starting offset for current level, which actually represents total items from all lower levels; let `prev_pages` be pages in the level below; `level_pages = ceil(prev_pages / page_size)`. For branch `j`, `page_offset = j * page_size + level_offset`.
101+
- Iterate levels until one page remains; the root is the last page and has `pageid = num_pages - 1`.
102+
- Page lengths: once all page offsets are collected, compute each `page_len` by the next offset difference; for the final page (root), `page_len = page_file_total_rows - page_offset` (where `page_file_total_rows` is total rows in `page_data.lance`).
103+
104+
Traversal starts from the root (`pageid = num_pages - 1`):
105+
106+
- If `page_offset < num_items` (leaf), read items `[page_offset .. page_offset + page_len)` and emit candidate `rowid`s matching the query bbox.
107+
- Otherwise (branch), descend into children whose bounding boxes match the query bbox.
108+
- Continue until there are no more pages to visit; the union of emitted `rowid`s forms the candidate set for evaluation.
109+
110+
## Accelerated Queries
111+
112+
The R-Tree index accelerates the following query types by returning a candidate set of matching bounding boxes. Exact geometry verification must be performed by the execution engine.
113+
114+
| Query Type | Description | Operation | Result Type |
115+
|:---------------|:---------------------------|:----------------------------------------------|:------------|
116+
| **Intersects** | `St_Intersects(col, geom)` | Prunes candidates by bbox intersection | AtMost |
117+
| **Contains** | `St_Contains(col, geom)` | Prunes candidates by bbox containment | AtMost |
118+
| **Within** | `St_Within(col, geom)` | Prunes candidates by bbox within relation | AtMost |
119+
| **Touches** | `St_Touches(col, geom)` | Prunes candidates by bbox touch relation | AtMost |
120+
| **Crosses** | `St_Crosses(col, geom)` | Prunes candidates by bbox crossing relation | AtMost |
121+
| **Overlaps** | `St_Overlaps(col, geom)` | Prunes candidates by bbox overlap relation | AtMost |
122+
| **Covers** | `St_Covers(col, geom)` | Prunes candidates by bbox cover relation | AtMost |
123+
| **CoveredBy** | `St_Coveredby(col, geom)` | Prunes candidates by bbox covered-by relation | AtMost |
124+
| **IsNull** | `col IS NULL` | Returns rows recorded in the nulls file | Exact |

protos/index.proto

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,4 +188,6 @@ message JsonIndexDetails {
188188
string path = 1;
189189
google.protobuf.Any target_details = 2;
190190
}
191-
message BloomFilterIndexDetails {}
191+
message BloomFilterIndexDetails {}
192+
193+
message RTreeIndexDetails {}

0 commit comments

Comments
 (0)