-
Notifications
You must be signed in to change notification settings - Fork 19
feat: Add CLI tools for ORC file inspection and manipulation #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds five new CLI tools for inspecting and manipulating ORC files: orc-read (stream data as CSV/JSON), orc-schema (display metadata and schema), orc-rowcount (report row counts), orc-index (inspect row group statistics), and orc-layout (emit physical layout as JSON). To support these tools, the proto module is made public to expose protobuf types, and serde/serde_json dependencies are added to the cli feature.
Key Changes
- Added five new CLI binaries with corresponding Cargo.toml bin entries
- Made
protomodule public to enable CLI tools to access low-level protobuf structures - Added
serdeandserde_jsonas optional dependencies under theclifeature - Created integration tests in
tests/bin/main.rsto verify basic CLI functionality
Reviewed changes
Copilot reviewed 2 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| Cargo.toml | Adds serde/serde_json to cli feature dependencies; registers 5 new binaries |
| src/lib.rs | Changes proto module from private to public |
| src/bin/orc-read.rs | New CLI tool to stream ORC data as CSV or JSON lines with stdin support |
| src/bin/orc-schema.rs | New CLI tool to print file metadata and schema with optional verbose mode |
| src/bin/orc-rowcount.rs | New CLI tool to report total row counts for one or more files |
| src/bin/orc-index.rs | New CLI tool to inspect row group statistics for a specific column |
| src/bin/orc-layout.rs | New CLI tool to emit JSON description of stripe physical layout |
| tests/bin/main.rs | Smoke tests for all new CLI binaries, gated behind cli feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Nit: Should we include an AI-generated README.md to demonstrate how to use the CLI? |
good idea |
|
Consider add |
ca9ea94 to
26d842a
Compare
|
It seems excessive to test the exact text of the help commands. This is going to break whenever we change an option or |
progval
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments below, but I didn't fully review
Yeah, checking the help output might be a bit too cumbersome; I will remove these cases. |
0a9bc23 to
5fc7396
Compare
This commit merges multiple ORC CLI commands into a single command structure, enhancing usability and maintainability. The previous commands for metadata inspection, data export, and statistics have been integrated into a cohesive CLI tool with subcommands for various functionalities, including `info`, `export`, `stats`, `layout`, and `index`. Additionally, the `orc` binary has been streamlined to facilitate easier command execution.
This commit expands the testing suite for the unified `orc` CLI binary, adding comprehensive tests for various subcommands including `info`, `export`, `stats`, `layout`, and `index`. It introduces helper functions for managing test data paths and expected output comparisons, ensuring that actual command outputs are validated against predefined expected results. Additionally, new expected output files have been created to support these tests, improving the robustness of the CLI tool's testing framework.
5fc7396 to
1c41f29
Compare
|
@progval I rebaseed the bloom filter fix, and now |
ORC CLI Tool
A unified command-line tool for inspecting and exporting Apache ORC files.
Installation
Build with the
clifeature enabled:The binary will be available at
target/release/orc.Usage
Commands
info- Display file metadata and schemaDisplay basic information about an ORC file including format version, compression, row count, and schema.
Options:
-v, --verbose- Include stripe layout details (offsets, lengths, rows)--row-count-only- Only display the row count for each fileexport- Export data to CSV or JSONExport ORC data to CSV or JSON format, with optional row limiting and column selection.
Options:
-f, --format <FORMAT>- Output format:csv(default) orjson-o, --output <FILE>- Output file (default: stdout)-n, --num-rows <N>- Export only first N records (0 = all)-c, --columns <COLS>- Comma-separated list of columns to export--batch-size <SIZE>- Batch size for reading (default: 8192)stats- Print column and stripe statisticsDisplay detailed statistics for each column and stripe, including min/max values, null counts, and type-specific stats.
Output includes:
layout- Print physical layout as JSONOutput a JSON representation of the file's physical layout, useful for debugging and analysis.
JSON structure:
{ "file": "path/to/file.orc", "format_version": "0.12", "compression": "ZLIB", "rows": 1000000, "stripes": [ { "index": 0, "offset": 3, "index_length": 550, "data_length": 12345, "footer_length": 100, "rows": 10000, "streams": [...], "encodings": [...] } ] }index- Print row group index informationInspect row indexes for a specific column, useful for debugging predicate pushdown and verifying writer-produced indexes.
Output includes:
bloom- Inspect bloom filtersInspect bloom filters in ORC files. Bloom filters are probabilistic data structures that can quickly determine if a value is definitely NOT present in a row group, useful for predicate pushdown optimization.
Options:
-c, --column <NAME>- Column name to inspect (show all if not specified)-t, --test <VALUE>- Test if a value might be contained in the bloom filterOutput includes:
--testis used)Examples
Inspecting a file
Exporting data
Debugging