feat: replace pdfjs with tabula for better pdf table extraction #41

DavidAmunga · 2025-10-12T14:31:29Z

Description

This PR replaces the PDF.js-based PDF parsing with Tabula (Java-based table extraction library) to significantly improve the accuracy and reliability of M-PESA statement parsing. This change provides better table structure recognition and handles various statement formats more consistently.

Major Changes

New Tabula-based PDF Processing

Added tabulaService.ts - New service that uses Tabula for table extraction via Rust backend
Extracts tables directly to CSV format, preserving structure better than text extraction
Improved header detection: parser now finds the header row first and ignores all preceding content
Better handling of quoted fields and complex CSV data

Removed PDF.js Dependencies

Removed pdfService.ts (524 lines) - Old PDF.js-based parser
Removed pdfjs-setup.ts - PDF.js worker configuration
Removed pdfjs-dist npm dependency
Simpler codebase with fewer dependencies

Enhanced Paybill Statement Support

Added support for Paybill-specific fields:
transactionType - Type of transaction
otherParty - The other party involved in the transaction
Updated CSV export to include paybill fields when present
Updated XLSX export to dynamically add paybill columns when needed
Automatic detection of paybill vs personal statements

Testing Recommendations

Test with personal M-PESA statements
Test with paybill M-PESA statements
Test with password-protected PDFs
Test with various statement formats and date ranges
Verify CSV and XLSX exports include all relevant fields

…hing

DavidAmunga added 6 commits October 12, 2025 00:35

feat: base setup

f0c0641

fix: jre setup

dfccd10

chore: remove pdfjs-dist

dea1d66

fix: parse from header row

313134d

docs: updated changeset

97d245d

chore: enhance JRE setup for Tabula and update system dependencies

84e14df

DavidAmunga self-assigned this Oct 12, 2025

DavidAmunga changed the title ~~feat: Replace PDF.js with Tabula for improved PDF table extraction~~ feat: replace pdfjs with tabula for better pdf table extraction Oct 12, 2025

DavidAmunga added 4 commits October 12, 2025 23:08

chore: standardize working directory for CI workflows and enhance cac…

5cea806

…hing

chore: remove Android build steps from main release workflow

0fdfec7

fix: revert working dir in actions

1065418

fix: pr check yml

14dc98a

DavidAmunga merged commit f7f0162 into main Oct 13, 2025
6 checks passed

DavidAmunga deleted the feat/add-tabula branch October 13, 2025 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: replace pdfjs with tabula for better pdf table extraction #41

feat: replace pdfjs with tabula for better pdf table extraction #41

Uh oh!

DavidAmunga commented Oct 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: replace pdfjs with tabula for better pdf table extraction #41

feat: replace pdfjs with tabula for better pdf table extraction #41

Uh oh!

Conversation

DavidAmunga commented Oct 12, 2025

Description

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants