|
| 1 | +# R-Tree Index |
| 2 | + |
| 3 | +The R-Tree index is a static, immutable 2D spatial index. It is built on bounding boxes to organize the data. This index is intended to accelerate rectangle-based pruning. |
| 4 | + |
| 5 | +It is designed as a multi-level hierarchical structure: leaf pages store tuples `(bbox, id=rowid)` for indexed geometries; branch pages aggregate child bounding boxes and store `id=pageid` pointing to child pages; a single root page encloses the entire tree. Conceptually, it can be thought of as an extension of the B+-tree to multidimensional objects, where bounding boxes act as keys for spatial pruning. |
| 6 | + |
| 7 | +The index uses a packed-build strategy where items are first sorted and then grouped into fixed-size leaf pages. |
| 8 | + |
| 9 | +This packed-build flow is: |
| 10 | +- Sort items (bboxes) according to the sorting algorithm. |
| 11 | +- Pack consecutive items into leaf pages of `page_size` entries; then build parent pages bottom-up by aggregating child page bboxes. |
| 12 | + |
| 13 | +## Sorting |
| 14 | + |
| 15 | +Sorting does not change the R-Tree data structure, but it is critical to performance. Currently, Hilbert sorting is implemented, but the design is extensible to other spatial sorting algorithms. |
| 16 | + |
| 17 | +### Hilbert Curve Sorting |
| 18 | + |
| 19 | +Hilbert sorting imposes a linear order on 2D items using a space-filling Hilbert curve to maximize locality in both axes. This improves leaf clustering, which benefits query pruning. |
| 20 | + |
| 21 | +Hilbert sorting is performed in three steps: |
| 22 | + |
| 23 | +1. **Global bounding box**: compute the global bbox `[xmin_g, ymin_g, xmax_g, ymax_g]` over all items for training index. |
| 24 | +2. **Normalize and compute Hilbert value**: |
| 25 | + - For each item bbox `[xmin_i, ymin_i, xmax_i, ymax_i]`, compute its center: |
| 26 | + - `cx = (xmin_i + xmax_i) / 2` |
| 27 | + - `cy = (ymin_i + ymax_i) / 2` |
| 28 | + - Map the center to a 16‑bit grid per axis using the global bbox. Let `W = xmax_g - xmin-g` and `H = ymax_g - ymin_g`. The normalized integer coordinates are: |
| 29 | + - `xi = round(((cx - xmin_g) / W) * (2^16 - 1))` |
| 30 | + - `yi = round(((cy - ymin_g) / H) * (2^16 - 1))` |
| 31 | + - If the global width or height is effectively zero, the corresponding axis is treated as degenerate and set to `0` for all items (the ordering then degenerates to 1D on the other axis). |
| 32 | + - For each `(xi, yi)` in `[0 .. 2^16-1] × [0 .. 2^16-1]`, compute a 32‑bit Hilbert value using a standard 2D Hilbert algorithm. In pseudocode (with `bits = 16`): |
| 33 | + ``` |
| 34 | + fn hilbert_value(x, y, bits): |
| 35 | + # x, y: integers in [0 .. 2^bits - 1] |
| 36 | + h = 0 |
| 37 | + mask = (1 << bits) - 1 |
| 38 | + |
| 39 | + for s from bits-1 down to 0: |
| 40 | + rx = (x >> s) & 1 |
| 41 | + ry = (y >> s) & 1 |
| 42 | + d = ((3 * rx) XOR ry) << (2 * s) |
| 43 | + h = h | d |
| 44 | + |
| 45 | + if ry == 0: |
| 46 | + if rx == 1: |
| 47 | + x = (~x) & mask |
| 48 | + y = (~y) & mask |
| 49 | + swap(x, y) |
| 50 | + |
| 51 | + return h |
| 52 | + ``` |
| 53 | + - The resulting `h` is stored as the item’s Hilbert value (type `u32` with `bits = 16`). |
| 54 | +3. **Sort**: sort items by Hilbert value. |
| 55 | +
|
| 56 | +## Index Details |
| 57 | +
|
| 58 | +```protobuf |
| 59 | +%%% proto.message.RTreeIndexDetails %%% |
| 60 | +``` |
| 61 | + |
| 62 | +## Storage Layout |
| 63 | + |
| 64 | +The R-Tree index consists of two files: |
| 65 | + |
| 66 | +1. `page_data.lance` - Stores all pages (leaf, branch) as repeated `(bbox, id)` tuples, written bottom-up (leaves first, then branch levels) |
| 67 | +2. `nulls.lance` - Stores a serialized RowIdTreeMap of rows with null |
| 68 | + |
| 69 | +### Page File Schema |
| 70 | + |
| 71 | +| Column | Type | Nullable | Description | |
| 72 | +|:-------|:---------|:---------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 73 | +| `bbox` | RectType | false | Type is Rect defined by [geoarrow-rs](https://github.com/geoarrow/geoarrow-rs) RectType; physical storage is Struct<xmin: Float64, ymin: Float64, xmax: Float64, ymax: Float64>. Represents the node bounding box (leaf: item bbox; branch: child aggregation). | |
| 74 | +| `id` | UInt64 | false | Reuse the `id` column to store `rowid` in leaf pages and `pageid` in branch pages | |
| 75 | + |
| 76 | +### Nulls File Schema |
| 77 | + |
| 78 | +| Column | Type | Nullable | Description | |
| 79 | +|:--------|:-------|:---------|:-----------------------------------------------------------| |
| 80 | +| `nulls` | Binary | false | Serialized RowIdTreeMap of rows with null/invalid geometry | |
| 81 | + |
| 82 | +### Schema Metadata |
| 83 | + |
| 84 | +The following optional keys can be used by implementations and are stored in the schema metadata: |
| 85 | + |
| 86 | +| Key | Type | Description | |
| 87 | +|:------------|:-------|:--------------------------------------------------| |
| 88 | +| `page_size` | String | Page size per page | |
| 89 | +| `num_pages` | String | Total number of pages written | |
| 90 | +| `num_items` | String | Number of non-null leaf items in the index | |
| 91 | +| `bbox` | String | JSON-serialized global BoundingBox of the dataset | |
| 92 | + |
| 93 | +### Query Traversal |
| 94 | + |
| 95 | +This index serializes the multi-level hierarchical RTree structure into a single page file following the schema above. At lookup time, the reader computes each page offset using the algorithm below and reconstructs the hierarchy for traversal. |
| 96 | + |
| 97 | +Offsets are derived from `num_items` and `page_size` of metadata as follows: |
| 98 | + |
| 99 | +- Leaf: `leaf_pages = ceil(num_items / page_size)`; leaf `i` has `page_offset = i * page_size`. |
| 100 | +- Branch: let `level_offset` be the starting offset for current level, which actually represents total items from all lower levels; let `prev_pages` be pages in the level below; `level_pages = ceil(prev_pages / page_size)`. For branch `j`, `page_offset = j * page_size + level_offset`. |
| 101 | +- Iterate levels until one page remains; the root is the last page and has `pageid = num_pages - 1`. |
| 102 | +- Page lengths: once all page offsets are collected, compute each `page_len` by the next offset difference; for the final page (root), `page_len = page_file_total_rows - page_offset` (where `page_file_total_rows` is total rows in `page_data.lance`). |
| 103 | + |
| 104 | +Traversal starts from the root (`pageid = num_pages - 1`): |
| 105 | + |
| 106 | +- If `page_offset < num_items` (leaf), read items `[page_offset .. page_offset + page_len)` and emit candidate `rowid`s matching the query bbox. |
| 107 | +- Otherwise (branch), descend into children whose bounding boxes match the query bbox. |
| 108 | +- Continue until there are no more pages to visit; the union of emitted `rowid`s forms the candidate set for evaluation. |
| 109 | + |
| 110 | +## Accelerated Queries |
| 111 | + |
| 112 | +The R-Tree index accelerates the following query types by returning a candidate set of matching bounding boxes. Exact geometry verification must be performed by the execution engine. |
| 113 | + |
| 114 | +| Query Type | Description | Operation | Result Type | |
| 115 | +|:---------------|:---------------------------|:----------------------------------------------|:------------| |
| 116 | +| **Intersects** | `St_Intersects(col, geom)` | Prunes candidates by bbox intersection | AtMost | |
| 117 | +| **Contains** | `St_Contains(col, geom)` | Prunes candidates by bbox containment | AtMost | |
| 118 | +| **Within** | `St_Within(col, geom)` | Prunes candidates by bbox within relation | AtMost | |
| 119 | +| **Touches** | `St_Touches(col, geom)` | Prunes candidates by bbox touch relation | AtMost | |
| 120 | +| **Crosses** | `St_Crosses(col, geom)` | Prunes candidates by bbox crossing relation | AtMost | |
| 121 | +| **Overlaps** | `St_Overlaps(col, geom)` | Prunes candidates by bbox overlap relation | AtMost | |
| 122 | +| **Covers** | `St_Covers(col, geom)` | Prunes candidates by bbox cover relation | AtMost | |
| 123 | +| **CoveredBy** | `St_Coveredby(col, geom)` | Prunes candidates by bbox covered-by relation | AtMost | |
| 124 | +| **IsNull** | `col IS NULL` | Returns rows recorded in the nulls file | Exact | |
0 commit comments