Implement Lance columnar storage for inference results by Copilot · Pull Request #699 · lincc-frameworks/hyrax

Copilot · 2026-02-12T02:24:27Z

Change Description

Replaces batched .npy file storage with Lance columnar format for inference results (infer, test, umap, engine verbs). New writes default to Lance for optimal performance; existing .npy files remain readable via auto-detection.

Solution Description

New classes (src/hyrax/data_sets/result_dataset.py):

ResultDatasetWriter: Includes tensor metadata in PyArrow schema and validation for tensor shape/dtype consistency across batches.
ResultDataset: PyTorch Dataset + HyraxQL getters (get_data, get_object_id)

Factory functions (src/hyrax/data_sets/result_factories.py):

create_results_writer(): Returns Lance writer
load_results_dataset(): Auto-detects Lance vs .npy on disk, returns appropriate reader

Shared utilities (src/hyrax/config_utils.py):

resolve_results_dir(): Centralized results directory resolution logic, now used by both InferenceDataSet and result_factories.py (eliminates code duplication)

Updated verbs:

infer, test, umap, engine: Use factories, call writer.commit() instead of write_index()
lookup, save_to_database: Updated to use load_results_dataset() for format-agnostic reading
pytorch_ignite.create_save_batch_callback(): Uses factory
InferenceDataSetWriter: Added commit() method for API compatibility

Dependencies: Added lancedb, pyarrow, pylance to pyproject.toml

Example usage:

# Writing (automatic in verbs, uses async internally)
writer = create_results_writer(dataset, results_dir)
writer.write_batch(object_ids, tensors)
writer.commit()  # optimization disabled by default for performance

# Reading (auto-detects format)
results = load_results_dataset(config, results_dir)
data = results[0]  # Returns numpy array
obj_id = results.get_object_id(0)  # HyraxQL getter

Testing: 16 unit tests covering batch writes, multi-dim tensors, dtypes, edge cases (NaN/Inf), indexing, iteration, cross-platform compatibility, async/sync modes.

Documentation: Updated lance_design_hyrax.md to reflect actual implementation (Lance-only writes, auto-detection on read, correct getter names).

Future work: Migration script for .npy → Lance conversion, visualize verb support, deprecation of InferenceDataSet.

Code Quality

I have read the Contribution Guide and agree to the Code of Conduct
My code follows the code style of this project
My code builds (or compiles) cleanly without any errors or warnings
My code contains relevant comments and necessary documentation

Original prompt

Start implementation

The user has attached the following file paths as relevant context:

.github/copilot-instructions.md

[Chronological Review: The conversation began with the user asking about the differences between `lance.dataset` and the combination of `lance.connect` and `.open_table`. This was followed by inquiries regarding the support of index types in `table.take_offsets` and the return type of `HyraxDataset`. The user then requested a plan for implementing a design document in a tractable and testable manner, and finally initiated the implementation phase.]

[Intent Mapping:

"Is there a difference between using lance.dataset and the combination of lance.connect and .open_table? Is one preferred?" - User seeks clarification on the usage of Lance library functions.
"Does table.take_offsets support all of these index types?" - User wants to confirm compatibility of index types with HyraxDataset.
"What about returning Torch.tensor? Is that promised by HyraxDataset?" - User is questioning the return type consistency of HyraxDataset.
"Take a look at this design document and develop a plan for implementing it..." - User requests a structured implementation plan for a design document.
"Start implementation" - User indicates readiness to begin the implementation process.]

[Technical Inventory:

lance.dataset: A function for accessing datasets in the Lance library.
lance.connect: A function to establish a connection to a Lance database.
table.take_offsets: A method for retrieving offsets from a table, with concerns about index type support.
HyraxDataset: A dataset class that the user is evaluating for compatibility with Torch tensors.]

[Code Archaeology:

No specific code files or functions were modified in this conversation, but the user is working within the context of /Users/derekjones/code/work/hyrax/specs/lance_db_spec.md. The focus is on understanding and implementing features related to the Lance library and HyraxDataset.]

[Progress Assessment:

Completed Tasks: User has clarified several technical aspects regarding the Lance library and HyraxDataset.
Partially Complete Work: The user has requested a plan for implementation and has initiated the implementation phase, but specific tasks within that phase are not detailed yet.]

[Context Validation: All necessary context for continuing the work on the Lance library and HyraxDataset is captured, including user inquiries and the current file being worked on.]

[Recent Commands Analysis:

Last Agent Commands: The user executed commands related to understanding the Lance library and initiating implementation.
Tool Results Summary: No specific tool results were returned as the conversation focused on inquiries and planning rather than executing code.
Pre-Summary State: The agent was actively working on the implementation of a design document in the context of the Lance library and HyraxDataset when the token budget was exceeded.
Operation Context: The user’s commands were aimed at clarifying technical details and preparing for implementation, aligning with their goal of developing a structured approach to the design document.]

1. Conversation Overview: - Primary Objectives: 1. "Is there a difference between using `lance.dataset` and the combination of `lance.connect` and `.open_table`? Is one preferred?" 2. "Does `table.take_offsets` support all of these index types?" 3. "What about returning Torch.tensor? Is that promised by HyraxDataset?" 4. "Take a look at this design document and develop a plan for implementing it..." 5. "Start implementation" - Session Context: The conversation has revolved around understanding the Lance library's functionalities and preparing for the implementation of a design document related to HyraxDataset. - User Intent Evolution: The user transitioned from seeking clarifications on library functions to requesting a structured implementation plan and finally initiating the implementation phase.

Technical Foundation:

lance.dataset: Function for accessing datasets in the Lance library.
lance.connect: Function to establish a connection to a Lance database.
table.take_offsets: Method for retrieving offsets from a table, with concerns about index type support.
HyraxDataset: Dataset class under evaluation for compatibility with Torch tensors.

Codebase Status:

File Name: /Users/derekjones/code/work/hyrax/specs/lance_db_spec.md
Purpose: This file is important for documenting tests and specifications related to the Lance library and HyraxDataset.
Current State: The user is in the process of implementing a design document.
Key Code Segments: Not specified in the conversation.
Dependencies: Related to the Lance library and HyraxDataset functionalities.

Problem Resolution:

Issues Encountered: User is uncertain about the compatibility of index types and return types in the context of Hyr...

Created from VS Code.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

* Use dedicated lance_db subdirectory for LanceDB storage --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

* Update lance_design.md terminology to match actual codebase Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Simplify writer factory to always use ResultDatasetWriter without config option Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: Derek T. Jones <dtj@mac.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> Co-authored-by: Derek T. Jones <dtj1s@uw.edu> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Address feedback on lines 104-106: clarify that tensor metadata (shape, dtype) is stored in Arrow table schema's custom metadata dictionary, show proper JSON serialization using json.dumps(), and explain the serialization/deserialization process with concrete code examples. Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: Derek T. Jones <dtj@mac.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> Co-authored-by: Derek T. Jones <dtj1s@uw.edu> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

- Simplify Path() conversions in result_dataset.py - Make single-line if statement for return - Add comment explaining [0] indexing - Add tensor shape/dtype validation in subsequent batches - Fix Windows-incompatible directory names in tests - Fix .numpy() call in umap.py for ResultDataset compatibility - Update lookup and save_to_database verbs to use load_results_dataset() - Add commit() method to InferenceDataSetWriter for API compatibility - Fix misleading docstring in create_results_writer() - Replace awkward __new__ pattern with standalone helper function - Update documentation to match actual implementation (get_data vs get_tensor) - Re-add pylance dependency (required by lancedb for dataset access) Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-actions · 2026-02-12T18:46:20Z

Before [`0aebbf0`]	After [`c36d109`]	Ratio	Benchmark (Parameter)
38.5±0.09s	54.7±0.2s	~1.42	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'qdrant')
457±6ms	3.41±0.01s	7.46	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(64, 'chromadb')
560±1ms	3.69±0s	6.59	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(256, 'chromadb')
1.53±0s	6.11±0.01s	4.00	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'chromadb')
9.60±0.06s	25.7±0.1s	2.67	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'chromadb')
2.18±0.02s	5.01±0s	2.30	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(64, 'qdrant')
2.87±0.03s	5.94±0s	2.07	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(256, 'qdrant')
6.86±0.01s	11.4±0s	1.66	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'qdrant')
1.27G	1.47G	1.16	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(2048, 'qdrant')
1.03G	1.19G	1.15	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(2048, 'chromadb')

Click here to view all benchmarks.

This reverts commit 0e361fe. Copilot knocked over the kettle.

* Fix test_nan.py to expect numpy arrays from inference results * Fix umap verb to return numpy arrays and result dataset * Fix test_test.py to expect ResultDataset instead of InferenceDataSet --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

codecov · 2026-02-12T22:14:55Z

Codecov Report

❌ Patch coverage is 88.54167% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.17%. Comparing base (0aebbf0) to head (f4fe719).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/hyrax/config_utils.py	63.15%	7 Missing ⚠️
src/hyrax/verbs/lookup.py	0.00%	5 Missing ⚠️
src/hyrax/data_sets/result_dataset.py	96.29%	4 Missing ⚠️
src/hyrax/verbs/engine.py	0.00%	3 Missing ⚠️
src/hyrax/data_sets/result_factories.py	90.00%	2 Missing ⚠️
src/hyrax/data_sets/inference_dataset.py	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #699      +/-   ##
==========================================
+ Coverage   63.25%   64.17%   +0.91%     
==========================================
  Files          59       61       +2     
  Lines        5770     5892     +122     
==========================================
+ Hits         3650     3781     +131     
+ Misses       2120     2111       -9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

src/hyrax/data_sets/result_dataset.py

drewoldag

Overall this is looking good to me, I'm ok clicking the approve button. I'll plan on working with it a bit tomorrow morning.

src/hyrax/data_sets/result_factories.py

src/hyrax/data_sets/result_dataset.py

- Add resolve_results_dir() to config_utils.py as a shared utility - Remove duplicate _resolve_results_dir() from result_factories.py - Remove duplicate _resolve_results_dir() from InferenceDataSet - Both result_factories and InferenceDataSet now use the shared function - Removes code duplication and centralizes results directory resolution logic Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

gitosaurus · 2026-02-14T00:20:09Z

@copilot The benchmark performance regressions appear to arise from .infer, which suggest that Lance is not fast enough in the current implementation. Use the recommended async/await approach for best speed.

- Implement async Lance operations using connect_async and AsyncTable - Keep synchronous API for backward compatibility (uses asyncio.run internally) - Add use_async parameter (default True) to enable/disable async mode - Remove automatic optimization from commit() (now opt-in via optimize parameter) - Optimization is expensive and not needed during writes per Lance docs - Async operations provide better performance for write-heavy workloads Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

Copilot · 2026-02-14T00:28:49Z

@copilot The benchmark performance regressions appear to arise from .infer, which suggest that Lance is not fast enough in the current implementation. Use the recommended async/await approach for b...

Implemented async/await support in 5842f93.

The ResultDatasetWriter now uses LanceDB's async operations (connect_async and AsyncTable.add) internally while maintaining a synchronous API for backward compatibility via asyncio.run().

Key changes:

Added use_async parameter (default True) to enable async Lance operations
Removed automatic table optimization from commit() - now opt-in via optimize parameter
Per Lance documentation, optimization is expensive and typically only needed after 100k+ records or 20+ write operations

All 16 unit tests pass with the new async implementation.

…erformance" This reverts commit 5842f93.

gitosaurus and others added 18 commits February 11, 2026 17:03

Design for integrating LanceDB into Hyrax

5fbd9c5

Address PR comments, add CLAUDE.md

4ab4666

write_index -> commit

3b6fe19

Update lance_design.md

c561d88

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Per interactive discussion with Drew

757ecc1

Update lance_design.md

989ce4a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update lance_design.md

c7c1395

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update lance_design.md

04e5f00

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update lance_design.md

2499131

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update lance_design.md

91f6aa3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update lance_design.md

cadb17c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Move spec to dedicated directory

3f4e096

Manual proofreading

85c518d

Minor fix

9d75bb8

Copilot AI assigned Copilot and gitosaurus Feb 12, 2026

Copilot started work on behalf of gitosaurus February 12, 2026 02:24 View session

Copilot AI and others added 6 commits February 12, 2026 02:27

Add ResultDataset and ResultDatasetWriter with Lance storage backend

2a500ac

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

Fix ResultDataset to use correct Lance API and add pylance dependency

f4d7b64

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

Wire ResultDataset writer into infer and engine verbs

4963d5b

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

Update umap and test verbs to use ResultDataset factories

7b9c74c

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

Fix linting issues in ResultDataset

d8cfcdb

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

Address code review feedback: fix type annotation consistency

e026b77

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Start implementation of design document~~ Implement Lance columnar storage for inference results Feb 12, 2026

Copilot AI requested a review from gitosaurus February 12, 2026 02:42

Copilot finished work on behalf of gitosaurus February 12, 2026 02:42

Copilot AI and others added 2 commits February 12, 2026 16:29

Fix linting issues - line length and f-string formatting

db986e3

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

Copilot AI requested a review from gitosaurus February 12, 2026 16:32

Copilot finished work on behalf of gitosaurus February 12, 2026 16:32

Update src/hyrax/verbs/umap.py

c36072e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

gitosaurus and others added 3 commits February 12, 2026 11:37

Fix test_umap.py to handle numpy array results directly

0e361fe

Revert "Fix test_umap.py to handle numpy array results directly"

8a11851

This reverts commit 0e361fe. Copilot knocked over the kettle.

gitosaurus reviewed Feb 12, 2026

View reviewed changes

src/hyrax/data_sets/result_dataset.py Show resolved Hide resolved

See #683 for the correct lance_db_spec.md

34ea1a7

drewoldag approved these changes Feb 13, 2026

View reviewed changes

src/hyrax/data_sets/result_factories.py Outdated Show resolved Hide resolved

src/hyrax/data_sets/result_dataset.py Show resolved Hide resolved

Copilot started work on behalf of gitosaurus February 13, 2026 23:57 View session

Copilot finished work on behalf of gitosaurus February 14, 2026 00:05

Copilot AI requested a review from gitosaurus February 14, 2026 00:05

Merge branch 'main' into copilot/vscode-mliu46ak-b0ag

0b58e86

Copilot started work on behalf of gitosaurus February 14, 2026 00:20 View session

Copilot AI changed the title ~~Implement Lance columnar storage for inference results~~ Implement Lance columnar storage for inference results with async/await performance optimization Feb 14, 2026

Copilot finished work on behalf of gitosaurus February 14, 2026 00:29

Revert "Add async/await support to ResultDatasetWriter for improved p…

f4fe719

…erformance" This reverts commit 5842f93.

gitosaurus changed the title ~~Implement Lance columnar storage for inference results with async/await performance optimization~~ Implement Lance columnar storage for inference results Feb 14, 2026

gitosaurus enabled auto-merge (squash) February 14, 2026 01:20

gitosaurus merged commit 2918dd6 into main Feb 14, 2026
7 checks passed

gitosaurus deleted the copilot/vscode-mliu46ak-b0ag branch February 14, 2026 01:23

gitosaurus mentioned this pull request Feb 14, 2026

Investigate the use of lance file format for storing InferenceDataset output #428

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Lance columnar storage for inference results#699

Implement Lance columnar storage for inference results#699
gitosaurus merged 35 commits intomainfrom
copilot/vscode-mliu46ak-b0ag

Copilot AI commented Feb 12, 2026 •

edited by gitosaurus

Loading

Uh oh!

github-actions bot commented Feb 12, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

drewoldag left a comment

Uh oh!

Uh oh!

Uh oh!

gitosaurus commented Feb 14, 2026

Uh oh!

Copilot AI commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Feb 12, 2026 • edited by gitosaurus Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Solution Description

Code Quality

Uh oh!

github-actions bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

drewoldag left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gitosaurus commented Feb 14, 2026

Uh oh!

Copilot AI commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Feb 12, 2026 •

edited by gitosaurus

Loading

github-actions bot commented Feb 12, 2026 •

edited

Loading

codecov bot commented Feb 12, 2026 •

edited

Loading