Skip to content

Conversation

@DavidAmunga
Copy link
Owner

Description

This PR replaces the PDF.js-based PDF parsing with Tabula (Java-based table extraction library) to significantly improve the accuracy and reliability of M-PESA statement parsing. This change provides better table structure recognition and handles various statement formats more consistently.

Major Changes

  1. New Tabula-based PDF Processing
  • Added tabulaService.ts - New service that uses Tabula for table extraction via Rust backend
    Extracts tables directly to CSV format, preserving structure better than text extraction
    Improved header detection: parser now finds the header row first and ignores all preceding content
    Better handling of quoted fields and complex CSV data
  1. Removed PDF.js Dependencies
  • Removed pdfService.ts (524 lines) - Old PDF.js-based parser
  • Removed pdfjs-setup.ts - PDF.js worker configuration
  • Removed pdfjs-dist npm dependency
  • Simpler codebase with fewer dependencies
  1. Enhanced Paybill Statement Support
  • Added support for Paybill-specific fields:
  • transactionType - Type of transaction
  • otherParty - The other party involved in the transaction
  • Updated CSV export to include paybill fields when present
  • Updated XLSX export to dynamically add paybill columns when needed
  • Automatic detection of paybill vs personal statements

Testing Recommendations

  • Test with personal M-PESA statements
  • Test with paybill M-PESA statements
  • Test with password-protected PDFs
  • Test with various statement formats and date ranges
  • Verify CSV and XLSX exports include all relevant fields

@DavidAmunga DavidAmunga self-assigned this Oct 12, 2025
@DavidAmunga DavidAmunga changed the title feat: Replace PDF.js with Tabula for improved PDF table extraction feat: replace pdfjs with tabula for better pdf table extraction Oct 12, 2025
@DavidAmunga DavidAmunga merged commit f7f0162 into main Oct 13, 2025
6 checks passed
@DavidAmunga DavidAmunga deleted the feat/add-tabula branch October 13, 2025 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants