refactor: consolidate to single memory-efficient scanner implementation #4

jackspirou · 2025-08-06T19:44:35Z

Summary

This PR consolidates the tokenizer to use a single memory-efficient implementation, eliminating code duplication and improving memory usage for all inputs.

Changes

Removed redundant stream command - The encode command now handles all cases efficiently
Simplified Scanner API - Changed from NewScanner() and NewScannerOptions() to a single NewScanner(r io.Reader, opts ...ScannerOption) following Go conventions
Memory-efficient implementation - All encoding now uses O(1) memory instead of O(n), making it suitable for files of any size
Added byte counting for metrics - When metrics are enabled with stdin input, accurately tracks input bytes using a counting reader
Updated documentation - Removed references to the stream command

Breaking Changes

⚠️ Removed stream command (use encode instead - it now handles streaming efficiently)
⚠️ Removed NewScannerOptions method (use NewScanner with options)

Why This Change?

The previous implementation had two separate code paths:

encode: Read entire input into memory (O(n) memory complexity)
stream: Used scanner with bounded memory (O(1) memory complexity)

The scanner implementation is strictly superior - it handles everything the old encode did but with constant memory usage. There was no reason to maintain both implementations.

Testing

✅ All existing tests pass
✅ Tested with large files (>1GB)
✅ Tested with piped input
✅ Tested with command-line args
✅ All output formats work (space, newline, JSON)
✅ Metrics work correctly with both stdin and args
✅ No linting issues

Performance

The new implementation:

Uses constant memory for all inputs
Can handle files larger than available RAM
Maintains 100% tokenization accuracy
Shows same or better performance for small inputs

🤖 Generated with Claude Code

- Remove redundant stream command and consolidate with encode - Simplify Scanner API to single NewScanner(r, opts...) method - Use memory-efficient scanner implementation for all encoding (O(1) memory) - Add byte counting for stdin inputs when metrics are enabled - Update documentation to reflect removal of stream command The scanner-based implementation is strictly superior to the old encode implementation - it handles everything encode did with constant memory usage instead of linear, making it suitable for files of any size. BREAKING CHANGE: Removed stream command (use encode instead) BREAKING CHANGE: Removed NewScannerOptions method (use NewScanner with options) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

codecov-commenter · 2025-08-06T19:47:31Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

- Remove references to stream.go file in CLAUDE.md - Update cmd/tokenizer/README.md to remove stream command examples - Fix NewScannerOptions references in IMPLEMENTATION.md - Regenerate shell completions without stream command These changes reflect the consolidation to a single memory-efficient scanner implementation that was merged in PR #4. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

jackspirou merged commit e9d530c into master Aug 6, 2025
11 checks passed

jackspirou mentioned this pull request Aug 6, 2025

docs: update documentation after stream command removal #5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: consolidate to single memory-efficient scanner implementation #4

refactor: consolidate to single memory-efficient scanner implementation #4

Uh oh!

jackspirou commented Aug 6, 2025

Uh oh!

Uh oh!

codecov-commenter commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

refactor: consolidate to single memory-efficient scanner implementation #4

refactor: consolidate to single memory-efficient scanner implementation #4

Uh oh!

Conversation

jackspirou commented Aug 6, 2025

Summary

Changes

Breaking Changes

Why This Change?

Testing

Performance

Uh oh!

Uh oh!

codecov-commenter commented Aug 6, 2025

Welcome to Codecov 🎉

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants