-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Priority: Medium
Extend document processing to support more file formats beyond PDF:
Current Support:
- PDF files only
- Basic text extraction with PyPDF2 and OCR
Requested Formats:
- DOCX (Microsoft Word documents)
- XLSX (Microsoft Excel spreadsheets)
- TXT (Plain text files)
- RTF (Rich Text Format)
- HTML (Web pages)
Expected Outcome:
- Unified processing pipeline for multiple formats
- Consistent categorization across all supported formats
- Format-specific text extraction optimization
Technical Approach:
- Add python-docx for DOCX support
- Add openpyxl for XLSX support
- Implement format detection based on file extension
- Create format-specific text extraction methods
- Update API to accept multiple file types
Metadata
Metadata
Assignees
Labels
No labels