Skip to content

Add Support for Additional Document Formats #10

@Davz33

Description

@Davz33

Priority: Medium

Extend document processing to support more file formats beyond PDF:

Current Support:

  • PDF files only
  • Basic text extraction with PyPDF2 and OCR

Requested Formats:

  • DOCX (Microsoft Word documents)
  • XLSX (Microsoft Excel spreadsheets)
  • TXT (Plain text files)
  • RTF (Rich Text Format)
  • HTML (Web pages)

Expected Outcome:

  • Unified processing pipeline for multiple formats
  • Consistent categorization across all supported formats
  • Format-specific text extraction optimization

Technical Approach:

  • Add python-docx for DOCX support
  • Add openpyxl for XLSX support
  • Implement format detection based on file extension
  • Create format-specific text extraction methods
  • Update API to accept multiple file types

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions