Implement Lance columnar storage for inference results#699
Conversation
* Use dedicated lance_db subdirectory for LanceDB storage --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
* Update lance_design.md terminology to match actual codebase Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Simplify writer factory to always use ResultDatasetWriter without config option Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: Derek T. Jones <dtj@mac.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> Co-authored-by: Derek T. Jones <dtj1s@uw.edu> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Address feedback on lines 104-106: clarify that tensor metadata (shape, dtype) is stored in Arrow table schema's custom metadata dictionary, show proper JSON serialization using json.dumps(), and explain the serialization/deserialization process with concrete code examples. Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: Derek T. Jones <dtj@mac.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> Co-authored-by: Derek T. Jones <dtj1s@uw.edu> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
- Simplify Path() conversions in result_dataset.py - Make single-line if statement for return - Add comment explaining [0] indexing - Add tensor shape/dtype validation in subsequent batches - Fix Windows-incompatible directory names in tests - Fix .numpy() call in umap.py for ResultDataset compatibility - Update lookup and save_to_database verbs to use load_results_dataset() - Add commit() method to InferenceDataSetWriter for API compatibility - Fix misleading docstring in create_results_writer() - Replace awkward __new__ pattern with standalone helper function - Update documentation to match actual implementation (get_data vs get_tensor) - Re-add pylance dependency (required by lancedb for dataset access) Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Click here to view all benchmarks. |
This reverts commit 0e361fe. Copilot knocked over the kettle.
* Fix test_nan.py to expect numpy arrays from inference results * Fix umap verb to return numpy arrays and result dataset * Fix test_test.py to expect ResultDataset instead of InferenceDataSet --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #699 +/- ##
==========================================
+ Coverage 63.25% 64.17% +0.91%
==========================================
Files 59 61 +2
Lines 5770 5892 +122
==========================================
+ Hits 3650 3781 +131
+ Misses 2120 2111 -9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
drewoldag
left a comment
There was a problem hiding this comment.
Overall this is looking good to me, I'm ok clicking the approve button. I'll plan on working with it a bit tomorrow morning.
- Add resolve_results_dir() to config_utils.py as a shared utility - Remove duplicate _resolve_results_dir() from result_factories.py - Remove duplicate _resolve_results_dir() from InferenceDataSet - Both result_factories and InferenceDataSet now use the shared function - Removes code duplication and centralizes results directory resolution logic Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
|
@copilot The benchmark performance regressions appear to arise from |
- Implement async Lance operations using connect_async and AsyncTable - Keep synchronous API for backward compatibility (uses asyncio.run internally) - Add use_async parameter (default True) to enable/disable async mode - Remove automatic optimization from commit() (now opt-in via optimize parameter) - Optimization is expensive and not needed during writes per Lance docs - Async operations provide better performance for write-heavy workloads Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Implemented async/await support in 5842f93. The Key changes:
All 16 unit tests pass with the new async implementation. |
…erformance" This reverts commit 5842f93.
Change Description
Replaces batched
.npyfile storage with Lance columnar format for inference results (infer,test,umap,engineverbs). New writes default to Lance for optimal performance; existing.npyfiles remain readable via auto-detection.Solution Description
New classes (
src/hyrax/data_sets/result_dataset.py):ResultDatasetWriter: Includes tensor metadata in PyArrow schema and validation for tensor shape/dtype consistency across batches.ResultDataset: PyTorch Dataset + HyraxQL getters (get_data,get_object_id)Factory functions (
src/hyrax/data_sets/result_factories.py):create_results_writer(): Returns Lance writerload_results_dataset(): Auto-detects Lance vs.npyon disk, returns appropriate readerShared utilities (
src/hyrax/config_utils.py):resolve_results_dir(): Centralized results directory resolution logic, now used by bothInferenceDataSetandresult_factories.py(eliminates code duplication)Updated verbs:
infer,test,umap,engine: Use factories, callwriter.commit()instead ofwrite_index()lookup,save_to_database: Updated to useload_results_dataset()for format-agnostic readingpytorch_ignite.create_save_batch_callback(): Uses factoryInferenceDataSetWriter: Addedcommit()method for API compatibilityDependencies: Added
lancedb,pyarrow,pylancetopyproject.tomlExample usage:
Testing: 16 unit tests covering batch writes, multi-dim tensors, dtypes, edge cases (NaN/Inf), indexing, iteration, cross-platform compatibility, async/sync modes.
Documentation: Updated
lance_design_hyrax.mdto reflect actual implementation (Lance-only writes, auto-detection on read, correct getter names).Future work: Migration script for
.npy→ Lance conversion,visualizeverb support, deprecation ofInferenceDataSet.Code Quality
Original prompt
Start implementation
The user has attached the following file paths as relevant context:
- .github/copilot-instructions.md
[Chronological Review: The conversation began with the user asking about the differences between `lance.dataset` and the combination of `lance.connect` and `.open_table`. This was followed by inquiries regarding the support of index types in `table.take_offsets` and the return type of `HyraxDataset`. The user then requested a plan for implementing a design document in a tractable and testable manner, and finally initiated the implementation phase.][Intent Mapping:
lance.datasetand the combination oflance.connectand.open_table? Is one preferred?" - User seeks clarification on the usage of Lance library functions.table.take_offsetssupport all of these index types?" - User wants to confirm compatibility of index types withHyraxDataset.HyraxDataset.[Technical Inventory:
lance.dataset: A function for accessing datasets in the Lance library.lance.connect: A function to establish a connection to a Lance database.table.take_offsets: A method for retrieving offsets from a table, with concerns about index type support.HyraxDataset: A dataset class that the user is evaluating for compatibility with Torch tensors.][Code Archaeology:
/Users/derekjones/code/work/hyrax/specs/lance_db_spec.md. The focus is on understanding and implementing features related to the Lance library and HyraxDataset.][Progress Assessment:
[Context Validation: All necessary context for continuing the work on the Lance library and HyraxDataset is captured, including user inquiries and the current file being worked on.]
[Recent Commands Analysis:
1. Conversation Overview: - Primary Objectives: 1. "Is there a difference between using `lance.dataset` and the combination of `lance.connect` and `.open_table`? Is one preferred?" 2. "Does `table.take_offsets` support all of these index types?" 3. "What about returning Torch.tensor? Is that promised by HyraxDataset?" 4. "Take a look at this design document and develop a plan for implementing it..." 5. "Start implementation" - Session Context: The conversation has revolved around understanding the Lance library's functionalities and preparing for the implementation of a design document related to HyraxDataset. - User Intent Evolution: The user transitioned from seeking clarifications on library functions to requesting a structured implementation plan and finally initiating the implementation phase.- Technical Foundation:
- Codebase Status:
- File Name:
- Purpose: This file is important for documenting tests and specifications related to the Lance library and HyraxDataset.
- Current State: The user is in the process of implementing a design document.
- Key Code Segments: Not specified in the conversation.
- Dependencies: Related to the Lance library and HyraxDataset functionalities.
- Problem Resolution:
- Issues Encountered: User is uncertain about the compatibility of index types and return types in the context of Hyr...
lance.dataset: Function for accessing datasets in the Lance library.lance.connect: Function to establish a connection to a Lance database.table.take_offsets: Method for retrieving offsets from a table, with concerns about index type support.HyraxDataset: Dataset class under evaluation for compatibility with Torch tensors./Users/derekjones/code/work/hyrax/specs/lance_db_spec.mdCreated from VS Code.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.