feat: replace pdfjs with tabula for better pdf table extraction #41
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR replaces the PDF.js-based PDF parsing with Tabula (Java-based table extraction library) to significantly improve the accuracy and reliability of M-PESA statement parsing. This change provides better table structure recognition and handles various statement formats more consistently.
Major Changes
tabulaService.ts- New service that uses Tabula for table extraction via Rust backendExtracts tables directly to CSV format, preserving structure better than text extraction
Improved header detection: parser now finds the header row first and ignores all preceding content
Better handling of quoted fields and complex CSV data
Testing Recommendations