Skip to content

Implement Lance columnar storage for inference results#699

Merged
gitosaurus merged 35 commits intomainfrom
copilot/vscode-mliu46ak-b0ag
Feb 14, 2026
Merged

Implement Lance columnar storage for inference results#699
gitosaurus merged 35 commits intomainfrom
copilot/vscode-mliu46ak-b0ag

Conversation

Copy link
Contributor

Copilot AI commented Feb 12, 2026

Change Description

Replaces batched .npy file storage with Lance columnar format for inference results (infer, test, umap, engine verbs). New writes default to Lance for optimal performance; existing .npy files remain readable via auto-detection.

Solution Description

New classes (src/hyrax/data_sets/result_dataset.py):

  • ResultDatasetWriter: Includes tensor metadata in PyArrow schema and validation for tensor shape/dtype consistency across batches.
  • ResultDataset: PyTorch Dataset + HyraxQL getters (get_data, get_object_id)

Factory functions (src/hyrax/data_sets/result_factories.py):

  • create_results_writer(): Returns Lance writer
  • load_results_dataset(): Auto-detects Lance vs .npy on disk, returns appropriate reader

Shared utilities (src/hyrax/config_utils.py):

  • resolve_results_dir(): Centralized results directory resolution logic, now used by both InferenceDataSet and result_factories.py (eliminates code duplication)

Updated verbs:

  • infer, test, umap, engine: Use factories, call writer.commit() instead of write_index()
  • lookup, save_to_database: Updated to use load_results_dataset() for format-agnostic reading
  • pytorch_ignite.create_save_batch_callback(): Uses factory
  • InferenceDataSetWriter: Added commit() method for API compatibility

Dependencies: Added lancedb, pyarrow, pylance to pyproject.toml

Example usage:

# Writing (automatic in verbs, uses async internally)
writer = create_results_writer(dataset, results_dir)
writer.write_batch(object_ids, tensors)
writer.commit()  # optimization disabled by default for performance

# Reading (auto-detects format)
results = load_results_dataset(config, results_dir)
data = results[0]  # Returns numpy array
obj_id = results.get_object_id(0)  # HyraxQL getter

Testing: 16 unit tests covering batch writes, multi-dim tensors, dtypes, edge cases (NaN/Inf), indexing, iteration, cross-platform compatibility, async/sync modes.

Documentation: Updated lance_design_hyrax.md to reflect actual implementation (Lance-only writes, auto-detection on read, correct getter names).

Future work: Migration script for .npy → Lance conversion, visualize verb support, deprecation of InferenceDataSet.

Code Quality

  • I have read the Contribution Guide and agree to the Code of Conduct
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation
Original prompt

Start implementation

The user has attached the following file paths as relevant context:

  • .github/copilot-instructions.md
[Chronological Review: The conversation began with the user asking about the differences between `lance.dataset` and the combination of `lance.connect` and `.open_table`. This was followed by inquiries regarding the support of index types in `table.take_offsets` and the return type of `HyraxDataset`. The user then requested a plan for implementing a design document in a tractable and testable manner, and finally initiated the implementation phase.]

[Intent Mapping:

  1. "Is there a difference between using lance.dataset and the combination of lance.connect and .open_table? Is one preferred?" - User seeks clarification on the usage of Lance library functions.
  2. "Does table.take_offsets support all of these index types?" - User wants to confirm compatibility of index types with HyraxDataset.
  3. "What about returning Torch.tensor? Is that promised by HyraxDataset?" - User is questioning the return type consistency of HyraxDataset.
  4. "Take a look at this design document and develop a plan for implementing it..." - User requests a structured implementation plan for a design document.
  5. "Start implementation" - User indicates readiness to begin the implementation process.]

[Technical Inventory:

  • lance.dataset: A function for accessing datasets in the Lance library.
  • lance.connect: A function to establish a connection to a Lance database.
  • table.take_offsets: A method for retrieving offsets from a table, with concerns about index type support.
  • HyraxDataset: A dataset class that the user is evaluating for compatibility with Torch tensors.]

[Code Archaeology:

  • No specific code files or functions were modified in this conversation, but the user is working within the context of /Users/derekjones/code/work/hyrax/specs/lance_db_spec.md. The focus is on understanding and implementing features related to the Lance library and HyraxDataset.]

[Progress Assessment:

  • Completed Tasks: User has clarified several technical aspects regarding the Lance library and HyraxDataset.
  • Partially Complete Work: The user has requested a plan for implementation and has initiated the implementation phase, but specific tasks within that phase are not detailed yet.]

[Context Validation: All necessary context for continuing the work on the Lance library and HyraxDataset is captured, including user inquiries and the current file being worked on.]

[Recent Commands Analysis:

  • Last Agent Commands: The user executed commands related to understanding the Lance library and initiating implementation.
  • Tool Results Summary: No specific tool results were returned as the conversation focused on inquiries and planning rather than executing code.
  • Pre-Summary State: The agent was actively working on the implementation of a design document in the context of the Lance library and HyraxDataset when the token budget was exceeded.
  • Operation Context: The user’s commands were aimed at clarifying technical details and preparing for implementation, aligning with their goal of developing a structured approach to the design document.]
1. Conversation Overview: - Primary Objectives: 1. "Is there a difference between using `lance.dataset` and the combination of `lance.connect` and `.open_table`? Is one preferred?" 2. "Does `table.take_offsets` support all of these index types?" 3. "What about returning Torch.tensor? Is that promised by HyraxDataset?" 4. "Take a look at this design document and develop a plan for implementing it..." 5. "Start implementation" - Session Context: The conversation has revolved around understanding the Lance library's functionalities and preparing for the implementation of a design document related to HyraxDataset. - User Intent Evolution: The user transitioned from seeking clarifications on library functions to requesting a structured implementation plan and finally initiating the implementation phase.
  1. Technical Foundation:
  • lance.dataset: Function for accessing datasets in the Lance library.
  • lance.connect: Function to establish a connection to a Lance database.
  • table.take_offsets: Method for retrieving offsets from a table, with concerns about index type support.
  • HyraxDataset: Dataset class under evaluation for compatibility with Torch tensors.
  1. Codebase Status:
  • File Name: /Users/derekjones/code/work/hyrax/specs/lance_db_spec.md
  • Purpose: This file is important for documenting tests and specifications related to the Lance library and HyraxDataset.
  • Current State: The user is in the process of implementing a design document.
  • Key Code Segments: Not specified in the conversation.
  • Dependencies: Related to the Lance library and HyraxDataset functionalities.
  1. Problem Resolution:
  • Issues Encountered: User is uncertain about the compatibility of index types and return types in the context of Hyr...

Created from VS Code.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

gitosaurus and others added 18 commits February 11, 2026 17:03
* Use dedicated lance_db subdirectory for LanceDB storage

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
* Update lance_design.md terminology to match actual codebase

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Simplify writer factory to always use ResultDatasetWriter without config option

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

---------

Co-authored-by: Derek T. Jones <dtj@mac.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: Derek T. Jones <dtj1s@uw.edu>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Address feedback on lines 104-106: clarify that tensor metadata (shape, dtype) is stored in Arrow table schema's custom metadata dictionary, show proper JSON serialization using json.dumps(), and explain the serialization/deserialization process with concrete code examples.

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

---------

Co-authored-by: Derek T. Jones <dtj@mac.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: Derek T. Jones <dtj1s@uw.edu>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI and others added 6 commits February 12, 2026 02:27
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Copilot AI changed the title [WIP] Start implementation of design document Implement Lance columnar storage for inference results Feb 12, 2026
Copilot AI requested a review from gitosaurus February 12, 2026 02:42
Copilot AI and others added 2 commits February 12, 2026 16:29
- Simplify Path() conversions in result_dataset.py
- Make single-line if statement for return
- Add comment explaining [0] indexing
- Add tensor shape/dtype validation in subsequent batches
- Fix Windows-incompatible directory names in tests
- Fix .numpy() call in umap.py for ResultDataset compatibility
- Update lookup and save_to_database verbs to use load_results_dataset()
- Add commit() method to InferenceDataSetWriter for API compatibility
- Fix misleading docstring in create_results_writer()
- Replace awkward __new__ pattern with standalone helper function
- Update documentation to match actual implementation (get_data vs get_tensor)
- Re-add pylance dependency (required by lancedb for dataset access)

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link

github-actions bot commented Feb 12, 2026

Before [0aebbf0] After [c36d109] Ratio Benchmark (Parameter)
38.5±0.09s 54.7±0.2s ~1.42 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'qdrant')
457±6ms 3.41±0.01s 7.46 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(64, 'chromadb')
560±1ms 3.69±0s 6.59 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(256, 'chromadb')
1.53±0s 6.11±0.01s 4.00 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'chromadb')
9.60±0.06s 25.7±0.1s 2.67 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'chromadb')
2.18±0.02s 5.01±0s 2.30 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(64, 'qdrant')
2.87±0.03s 5.94±0s 2.07 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(256, 'qdrant')
6.86±0.01s 11.4±0s 1.66 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'qdrant')
1.27G 1.47G 1.16 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(2048, 'qdrant')
1.03G 1.19G 1.15 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(2048, 'chromadb')

Click here to view all benchmarks.

gitosaurus and others added 3 commits February 12, 2026 11:37
This reverts commit 0e361fe.

Copilot knocked over the kettle.
* Fix test_nan.py to expect numpy arrays from inference results
* Fix umap verb to return numpy arrays and result dataset
* Fix test_test.py to expect ResultDataset instead of InferenceDataSet

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 88.54167% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.17%. Comparing base (0aebbf0) to head (f4fe719).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/hyrax/config_utils.py 63.15% 7 Missing ⚠️
src/hyrax/verbs/lookup.py 0.00% 5 Missing ⚠️
src/hyrax/data_sets/result_dataset.py 96.29% 4 Missing ⚠️
src/hyrax/verbs/engine.py 0.00% 3 Missing ⚠️
src/hyrax/data_sets/result_factories.py 90.00% 2 Missing ⚠️
src/hyrax/data_sets/inference_dataset.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #699      +/-   ##
==========================================
+ Coverage   63.25%   64.17%   +0.91%     
==========================================
  Files          59       61       +2     
  Lines        5770     5892     +122     
==========================================
+ Hits         3650     3781     +131     
+ Misses       2120     2111       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking good to me, I'm ok clicking the approve button. I'll plan on working with it a bit tomorrow morning.

- Add resolve_results_dir() to config_utils.py as a shared utility
- Remove duplicate _resolve_results_dir() from result_factories.py
- Remove duplicate _resolve_results_dir() from InferenceDataSet
- Both result_factories and InferenceDataSet now use the shared function
- Removes code duplication and centralizes results directory resolution logic

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
@gitosaurus
Copy link
Contributor

@copilot The benchmark performance regressions appear to arise from .infer, which suggest that Lance is not fast enough in the current implementation. Use the recommended async/await approach for best speed.

- Implement async Lance operations using connect_async and AsyncTable
- Keep synchronous API for backward compatibility (uses asyncio.run internally)
- Add use_async parameter (default True) to enable/disable async mode
- Remove automatic optimization from commit() (now opt-in via optimize parameter)
- Optimization is expensive and not needed during writes per Lance docs
- Async operations provide better performance for write-heavy workloads

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 14, 2026

@copilot The benchmark performance regressions appear to arise from .infer, which suggest that Lance is not fast enough in the current implementation. Use the recommended async/await approach for b...

Implemented async/await support in 5842f93.

The ResultDatasetWriter now uses LanceDB's async operations (connect_async and AsyncTable.add) internally while maintaining a synchronous API for backward compatibility via asyncio.run().

Key changes:

  • Added use_async parameter (default True) to enable async Lance operations
  • Removed automatic table optimization from commit() - now opt-in via optimize parameter
  • Per Lance documentation, optimization is expensive and typically only needed after 100k+ records or 20+ write operations

All 16 unit tests pass with the new async implementation.

Copilot AI changed the title Implement Lance columnar storage for inference results Implement Lance columnar storage for inference results with async/await performance optimization Feb 14, 2026
@gitosaurus gitosaurus changed the title Implement Lance columnar storage for inference results with async/await performance optimization Implement Lance columnar storage for inference results Feb 14, 2026
@gitosaurus gitosaurus enabled auto-merge (squash) February 14, 2026 01:20
@gitosaurus gitosaurus merged commit 2918dd6 into main Feb 14, 2026
7 checks passed
@gitosaurus gitosaurus deleted the copilot/vscode-mliu46ak-b0ag branch February 14, 2026 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants