A pure Python library for extracting text, metadata, and structured elements from Microsoft Office files—both modern (.docx, .xlsx, .pptx) and legacy (.doc, .xls, .ppt) formats—plus PDF, email formats, and plain text.
The library also includes an optional SharePoint client for reading files directly from Microsoft SharePoint sites via the Graph API. You still orchestrate the pipeline: pull files (via sharepoint_io or your own Graph client), then pass the bytes into the extractors.
Install: uv add sharepoint-to-text
Python import: import sharepoint2text
CLI (text): sharepoint2text /path/to/file.docx > extraction.txt
CLI (JSON, full extraction): sharepoint2text --json /path/to/file.docx > extraction.json (no binary by default; add --binary to include)
CLI (JSON, units): sharepoint2text --json-unit /path/to/file.docx > units.json (no binary by default; add --binary to include)
- Unified API:
sharepoint2text.read_file(path)yields one or more typed extraction results. - Typed results: each format returns a specific dataclass (e.g.
DocxContent,PdfContent) that also supports the common interface. - Text:
get_full_text()oriterate_units()(pages / slides / sheets depending on format; callunit.get_text()for the string). - Structured content: tables and images where the format supports it.
- Metadata: file metadata (plus format-specific metadata where available).
- Serialization:
result.to_json()returns a JSON-serializable dict.
Every extracted result implements the same high-level interface (ExtractionInterface). Use it to build pipelines that work across file types without special-casing .pdf vs .docx vs .pptx.
| Goal | Recommended | Why |
|---|---|---|
| Get “the document text” as one string | result.get_full_text() |
Best default for indexing and simple exports; hides format-specific unit details. |
| Chunk text by page/slide/sheet (RAG, citations, per-unit metadata) | result.iterate_units() |
Stable unit boundaries for formats that have them (PDF pages, PPT slides, XLS(X) sheets). |
| Extract images (and optionally store payloads) | result.iterate_images() |
Returns image objects with metadata; binary payload handling is caller-controlled. |
| Extract tables as structured data | result.iterate_tables() |
Returns table objects as 2D arrays, suitable for CSV/JSON downstream. |
| Attach filename/path context | result.get_metadata() |
Normalizes file metadata regardless of format; useful for provenance and linking. |
| Persist/transport results | result.to_json() / ExtractionInterface.from_json(...) |
JSON-serializable representation; optional base64 encoding for binary fields. |
get_full_text():- Use when you want a single string per extracted item (search indexing, previews, “export to .txt”).
- It is usually derived from
iterate_units(), but some formats may prepend metadata (e.g., titles) or omit optional content by default.
iterate_units():- Use when you need chunk boundaries aligned with the source structure (pages/slides/sheets) or when you want to keep unit-level metadata.
- Each unit supports
unit.get_text(),unit.get_images(),unit.get_tables(), andunit.get_metadata().
iterate_images()/iterate_tables():- Use when you want all images/tables across the document (often simpler than traversing units).
- Prefer unit-level access (
unit.get_images(),unit.get_tables()) when you need “where did this come from?” context (page/slide number).
get_metadata():- Use for provenance fields like
filename,file_extension,file_path,folder_path. - Pair with unit metadata for precise citations (e.g.,
file_path + page_number).
- Use for provenance fields like
to_json()/from_json():- Use to store results, send them across processes, or debug extraction output.
- Binary payloads are representable but can be large; omit them unless you explicitly need embedded data.
Plain text (single string):
import sharepoint2text
result = next(sharepoint2text.read_file("document.pdf"))
text = result.get_full_text()Unit-based chunking (recommended for RAG):
import sharepoint2text
result = next(sharepoint2text.read_file("deck.pptx"))
for unit in result.iterate_units():
chunk = unit.get_text()
unit_meta = unit.get_metadata() # e.g., slide/page/sheet number when availableUnlike popular alternatives that shell out to LibreOffice or Apache Tika (requiring Java), sharepoint-to-text is a native Python implementation with no system-level dependencies:
| Approach | Requirements | Cross-platform | Container-friendly |
|---|---|---|---|
| sharepoint-to-text | uv add only |
Yes | Yes (minimal image) |
| LibreOffice-based | LibreOffice install, X11/headless setup | Complex | Large images (~1GB+) |
| Apache Tika | Java runtime, Tika server | Complex | Heavy (~500MB+) |
| subprocess-based | Shell access, security concerns | No | Risky |
This library parses Office binary formats (OLE2) and XML-based formats (OOXML) directly in Python, making it ideal for:
- RAG pipelines and LLM document ingestion
- Serverless functions (AWS Lambda, Google Cloud Functions)
- Containerized deployments with minimal footprint
- Secure environments where shell execution is restricted
- Cross-platform applications (Windows, macOS, Linux)
Enterprise SharePoints contain decades of accumulated documents. While modern .docx, .xlsx, and .pptx files are well-supported, legacy .doc, .xls, and .ppt files remain common. This library provides a unified interface for all formats—no conditional logic needed.
For scenarios where documents live in Microsoft SharePoint, the library includes a built-in Graph API client. This is an optional convenience layer, not required for local files or other storage backends. You are responsible for orchestrating the pull (list/download) and then calling the extractors:
from sharepoint2text.sharepoint_io import SharePointRestClient, EntraIDAppCredentials
credentials = EntraIDAppCredentials(
tenant_id="your-tenant-id",
client_id="your-client-id",
client_secret="your-client-secret",
)
client = SharePointRestClient(site_url="https://contoso.sharepoint.com/sites/Docs", credentials=credentials)
# List and download files
for file in client.list_all_files():
content = client.download_file(file.id)
# Pass to sharepoint2text extractors...The client supports filtering by modification date (for delta-sync patterns), folder paths, and file extensions. See sharepoint2text/sharepoint_io/SETUP.md for Azure/Entra ID configuration instructions.
| Format | Extension | Description |
|---|---|---|
| Word 97-2003 | .doc |
Word 97-2003 documents |
| Excel 97-2003 | .xls |
Excel 97-2003 spreadsheets |
| PowerPoint 97-2003 | .ppt |
PowerPoint 97-2003 presentations |
| Rich Text Format | .rtf |
Rich Text Format documents |
| Format | Extension | Description |
|---|---|---|
| Word 2007+ | .docx |
Word 2007+ documents |
| Word 2007+ (macro) | .docm |
Word 2007+ macro-enabled documents |
| Excel 2007+ | .xlsx |
Excel 2007+ spreadsheets |
| Excel 2007+ (macro) | .xlsm |
Excel 2007+ macro-enabled spreadsheets |
| PowerPoint 2007+ | .pptx |
PowerPoint 2007+ presentations |
| PowerPoint 2007+ (macro) | .pptm |
PowerPoint 2007+ macro-enabled presentations |
| Format | Extension | Description |
|---|---|---|
| Text | .odt |
OpenDocument Text |
| Presentation | .odp |
OpenDocument Presentation |
| Spreadsheet | .ods |
OpenDocument Spreadsheet |
| Format | Extension | Description |
|---|---|---|
| EML | .eml |
RFC 822 email format |
| MSG | .msg |
Microsoft Outlook email format |
| MBOX | .mbox |
Unix mailbox format (multiple emails) |
| Format | Extension | Description |
|---|---|---|
| Plain Text | .txt |
Plain text files |
| Markdown | .md |
Markdown |
| CSV | .csv |
Comma-separated values |
| TSV | .tsv |
Tab-separated values |
| JSON | .json |
JSON files |
| Format | Extension | Description |
|---|---|---|
.pdf |
PDF documents |
| Format | Extension | Description |
|---|---|---|
| HTML | .html, .htm |
HTML documents |
| MHTML | .mhtml, .mht |
MIME HTML (web archive) files |
| EPUB | .epub |
EPUB e-book format |
| Format | Extension | Description |
|---|---|---|
| ZIP | .zip |
ZIP archives |
| TAR | .tar |
TAR archives |
| Gzip TAR | .tar.gz, .tgz, .gz |
Gzip-compressed TAR archives |
| Bzip2 TAR | .tar.bz2, .tbz2, .bz2 |
Bzip2-compressed TAR archives |
| XZ TAR | .tar.xz, .txz, .xz |
XZ-compressed TAR archives |
Archive extraction recursively processes all supported files within the archive.
uv add sharepoint-to-textOptional: faster AES handling for encrypted PDFs (avoids the slow fallback crypto and large-PDF image skips):
uv add "sharepoint-to-text[pdf-crypto]"Or install from source:
git clone https://github.com/Horsmann/sharepoint-to-text.git
cd sharepoint-to-text
uv sync --all-groupsThese are required for normal use of the library:
charset-normalizer: Automatic encoding detection for plain text filesdefusedxml: Hardened XML parsing for OOXML/ODF formatsmail-parser: RFC 822 email parsing (.eml)msg-parser: Outlook.msgextractionolefile: OLE2 container parsing for legacy Office formatsopenpyxl:.xlsxparsingpypdf:.pdfparsingxlrd:.xlsparsing
These are only needed for development workflows:
pytest: test runnerpre-commit: linting/format hooksblack: code formatter
These are opt-in extras for specific use cases:
pycryptodome: Faster AES crypto for encrypted PDFs (pdf-cryptoextra)
sharepoint2text.read_file(...) returns a generator of extraction results implementing a common interface. Most formats yield a single item, but some (notably .mbox) can yield multiple items.
import sharepoint2text
# Works identically for ANY supported format
# Most formats yield a single item, so use next() for convenience
for result in sharepoint2text.read_file("document.docx"): # or .doc, .pdf, .pptx, etc.
# Methods available on ALL content types:
text = result.get_full_text() # Complete text as a single string
metadata = result.get_metadata() # File metadata (filename/path; plus format-specific fields when available)
# Iterate over logical units (varies by format - see below)
for unit in result.iterate_units():
print(unit.get_text())
# Iterate over extracted images
for image in result.iterate_images():
print(image)
# Iterate over extracted tables
for table in result.iterate_tables():
print(table)
# For single-item formats, you can use next() directly:
result = next(sharepoint2text.read_file("document.docx"))
print(result.get_full_text())Notes: ImageInterface provides get_bytes(), get_content_type(), get_caption(), get_description(), and get_metadata() (unit index, image index, content type, width, height). TableInterface provides get_table() (rows as lists) and get_dim() (rows, columns).
Most results also expose format-specific structured fields (e.g. PdfContent.pages, PptxContent.slides, XlsxContent.sheets) in addition to the common interface—see Return Types below.
All extraction results support to_json() for a JSON-serializable representation of the extracted data (including nested dataclasses).
import json
import sharepoint2text
result = next(sharepoint2text.read_file("document.docx"))
print(json.dumps(result.to_json()))To restore objects from JSON, use ExtractionInterface.from_json(...).
from sharepoint2text.parsing.extractors.data_types import ExtractionInterface
restored = ExtractionInterface.from_json(result.to_json())Different file formats have different natural structural units:
| Format | iterate_units() yields |
Notes |
|---|---|---|
.docx, .doc, .odt |
1 item (full text) | Word/text documents have no page structure in the file format |
.xlsx, .xls, .ods |
1 item per sheet | Each yield contains sheet content |
.pptx, .ppt, .odp |
1 item per slide | Each yield contains slide text |
.pdf |
1 item per page | Each yield contains page text |
.eml, .msg |
1 item (email body) | Plain text or HTML body |
.mbox |
1 item per email | Mailboxes can contain multiple emails |
.txt, .csv, .json, .tsv |
1 item (full content) | Single unit |
Note on Word documents: The .doc and .docx file formats do not store page boundaries—pages are a rendering artifact determined by fonts, margins, and printer settings. The library returns the full document as a single text unit.
Note on generators: All extractors return generators. Most formats yield a single content object, but .mbox files can yield multiple EmailContent objects (one per email in the mailbox). Use next() for single-item formats or iterate with for to handle all cases.
The interface provides two methods for accessing text content, and you must decide which is appropriate for your use case:
| Method | Returns | Best for |
|---|---|---|
get_full_text() |
All text as a single string | Simple extraction, full-text search, when structure doesn't matter |
iterate_units() |
Yields logical units (pages, slides, sheets) | RAG pipelines, per-unit indexing, preserving document structure |
For RAG and vector storage: Consider whether storing pages/slides/sheets as separate chunks with metadata (e.g., page numbers) benefits your retrieval strategy. This allows more precise source attribution when users query your system.
# Option 1: Store entire document as one chunk
result = next(sharepoint2text.read_file("report.pdf"))
store_in_vectordb(text=result.get_full_text(), metadata={"source": "report.pdf"})
# Option 2: Store each page separately with page numbers
result = next(sharepoint2text.read_file("report.pdf"))
for page_num, unit in enumerate(result.iterate_units(), start=1):
store_in_vectordb(
text=unit.get_text(),
metadata={"source": "report.pdf", "page": page_num}
)Trade-offs to consider:
- Per-unit storage enables citing specific pages/slides in responses, but creates more chunks
- Full-text storage is simpler and may work better for small documents
- Word documents (
.doc,.docx) only yield one unit fromiterate_units()since they lack page structure—for these formats, both methods are equivalent
get_full_text() is intended as a convenient “best default” for each format. In a few formats it intentionally differs from a plain "\n".join(unit.get_text() for unit in iterate_units()), or it omits optional content unless you opt in:
| Format | get_full_text() default behavior |
Not included by default / where to find it |
|---|---|---|
.doc |
Prepends metadata.title (if present) and returns main document body |
footnotes, headers_footers, annotations are separate fields (DocContent) |
.docx |
Returns full_text (including formulas) |
Comments are available on DocxContent.comments (not included in get_full_text()) |
.ppt |
Per-slide title + body + other concatenated |
Speaker notes live in slide.notes (PptSlideContent) |
.pptx |
Per-slide base_text plus formulas concatenated |
Pass include_image_captions to PptxContent.get_full_text(...) (comments are available on PptxSlide.comments) |
.odp |
Per-slide text_combined concatenated |
Pass include_notes/include_annotations to OdpContent.get_full_text(...) |
.xls |
Concatenation of sheet text blocks (no sheet names) |
Sheet names are available as sheet.name (XlsSheet) |
.xlsx, .ods |
Includes sheet name + sheet text for each sheet | Images are available via iterate_images() / sheet image lists |
.pdf |
Concatenation of extracted page text | Tables/images are available via iterate_tables() / iterate_images() (PdfContent.pages) |
.eml, .msg, .mbox |
Returns body_plain when present, else body_html |
Attachments are in EmailContent.attachments and can be extracted via iterate_supported_attachments() |
.txt, .csv, .tsv, .json, .md, .html |
Returns stripped content (leading/trailing whitespace removed) | Use the raw fields (.content) if you need untrimmed text |
.rtf |
Returns the extractor’s full_text when available |
iterate_units() yields per-page text when explicit \\page breaks exist |
import sharepoint2text
# Extract from any file - format auto-detected (use next() for single-item formats)
result = next(sharepoint2text.read_file("quarterly_report.docx"))
print(result.get_full_text())
# Check format support before processing
if sharepoint2text.is_supported_file("document.xyz"):
for result in sharepoint2text.read_file("document.xyz"):
print(result.get_full_text())
# Access metadata
result = next(sharepoint2text.read_file("presentation.pptx"))
meta = result.get_metadata()
print(f"Author: {meta.author}, Modified: {meta.modified}")
print(meta.to_dict()) # Convert to dictionary
# Process emails (mbox can contain multiple emails)
for email in sharepoint2text.read_file("mailbox.mbox"):
print(f"From: {email.from_email.address}")
print(f"Subject: {email.subject}")
print(email.get_full_text())import sharepoint2text
# Excel: iterate over sheets
result = next(sharepoint2text.read_file("budget.xlsx"))
for sheet in result.sheets:
print(f"Sheet: {sheet.name}")
print(f"Rows: {len(sheet.data)}") # List of row dictionaries
print(sheet.text) # Text representation
# PowerPoint: iterate over slides
result = next(sharepoint2text.read_file("deck.pptx"))
for slide in result.slides:
print(f"Slide {slide.slide_number}: {slide.title}")
print(slide.content_placeholders) # Body text
print(slide.images) # Image metadata
# PDF: iterate over pages
result = next(sharepoint2text.read_file("report.pdf"))
for page_num, page in enumerate(result.pages, start=1):
print(f"Page {page_num}: {page.text[:100]}...")
print(f"Images: {len(page.images)}")
# Email: access email-specific fields
email = next(sharepoint2text.read_file("message.eml"))
print(f"From: {email.from_email.name} <{email.from_email.address}>")
print(f"To: {', '.join(e.address for e in email.to_emails)}")
print(f"Subject: {email.subject}")
print(f"Body: {email.body_plain or email.body_html}")For API responses or in-memory data:
import sharepoint2text
import io
# Direct extractor usage with BytesIO (returns generator, use next() for single items)
with open("document.docx", "rb") as f:
result = next(sharepoint2text.read_docx(io.BytesIO(f.read()), path="document.docx"))
# Get extractor dynamically based on filename
def extract_from_api(filename: str, content: bytes):
extractor = sharepoint2text.get_extractor(filename)
# Returns a generator - iterate or use next()
return list(extractor(io.BytesIO(content), path=filename))
results = extract_from_api("report.pdf", pdf_bytes)
for result in results:
print(result.get_full_text())- No OCR support: This library does not perform optical character recognition. PDFs that consist of scanned images or photos of documents will return empty text. The images themselves are still extracted and available via
iterate_images(), but no text is derived from them. - Table detection is best-effort: PDF table extraction relies on parseable text content and heuristics to identify table structures. Complex layouts, merged cells, or tables spanning multiple pages may not be detected accurately. Results should be validated for critical use cases.
- Image extraction on large encrypted files: When a PDF is AES-encrypted and pypdf is running in its fallback crypto provider (i.e., neither
cryptographynorpycryptodomeis installed), image extraction is skipped for large files (>= 10MB). Text and tables still extract, but image lists are empty. Installcryptographyorpycryptodometo enable full PDF image extraction without this skip. - Password-protected PDFs: PDFs requiring a non-empty password are rejected with an
ExtractionFileEncryptedError.
After installation, a sharepoint2text command is available. It accepts a single file path and prints the extracted full text to stdout by default.
sharepoint2text /path/to/file.pdf > extraction.txt| Option | Output | Notes |
|---|---|---|
| (default) | Plain text | Prints result.get_full_text() (blank-line separated if multiple items). |
--json |
JSON extraction object(s) | Prints result.to_json(); emits a single JSON object (one item) or a JSON array (multiple items). Binary fields are null by default; add --binary to include base64 blobs. |
--json-unit |
JSON unit list(s) | Prints a JSON list of unit representations using result.iterate_units() (e.g., pages/slides/sheets). For multi-item inputs (e.g. .mbox), emits a JSON list where each item is that extraction’s unit list. Binary fields are null by default; add --binary to include base64 blobs. |
--binary |
Include binary payloads | Only valid with --json or --json-unit. Encodes bytes/BytesIO as base64 in wrapper objects. |
--json and --json-unit are mutually exclusive.
To emit structured output for the full extraction object, use --json:
sharepoint2text --json /path/to/file.pdf > extraction.jsonTo emit per-unit output (pages/slides/sheets depending on format), use --json-unit:
sharepoint2text --json-unit /path/to/file.pdf > units.jsonSome formats include binary payloads (e.g., embedded images in Office/PDF files, email attachments). The CLI omits binary payloads in JSON by default (emits null for binary fields). Use --binary to include base64 blobs:
sharepoint2text --json /path/to/file.pdf > extraction.json
# include binary payloads
sharepoint2text --json --binary /path/to/file.pdf > extraction.with-binary.json# include binary payloads (units mode)
sharepoint2text --json-unit --binary /path/to/file.pdf > units.with-binary.jsonNote: the PDF image skip described in the limitations section also applies to CLI output. In that scenario, --json/--json-unit will report empty image lists even with --binary, because the images are not extracted.
import sharepoint2text
# Read any supported file (recommended entry point)
# Returns a generator - use next() for single-item formats or iterate for all
for result in sharepoint2text.read_file(path: str | Path):
...
# Check if a file extension is supported
supported = sharepoint2text.is_supported_file(path: str) -> bool
# Get extractor function for a file type
extractor = sharepoint2text.get_extractor(path: str) -> Callable[[io.BytesIO, str | None], Generator[ContentType, Any, None]]All accept io.BytesIO and optional path for metadata population. All return generators:
sharepoint2text.read_docx(file: io.BytesIO, path: str | None = None) -> Generator[DocxContent, Any, None]
sharepoint2text.read_doc(file: io.BytesIO, path: str | None = None) -> Generator[DocContent, Any, None]
sharepoint2text.read_xlsx(file: io.BytesIO, path: str | None = None) -> Generator[XlsxContent, Any, None]
sharepoint2text.read_xls(file: io.BytesIO, path: str | None = None) -> Generator[XlsContent, Any, None]
sharepoint2text.read_pptx(file: io.BytesIO, path: str | None = None) -> Generator[PptxContent, Any, None]
sharepoint2text.read_ppt(file: io.BytesIO, path: str | None = None) -> Generator[PptContent, Any, None]
sharepoint2text.read_odt(file: io.BytesIO, path: str | None = None) -> Generator[OdtContent, Any, None]
sharepoint2text.read_odp(file: io.BytesIO, path: str | None = None) -> Generator[OdpContent, Any, None]
sharepoint2text.read_ods(file: io.BytesIO, path: str | None = None) -> Generator[OdsContent, Any, None]
sharepoint2text.read_pdf(file: io.BytesIO, path: str | None = None) -> Generator[PdfContent, Any, None]
sharepoint2text.read_plain_text(file: io.BytesIO, path: str | None = None) -> Generator[PlainTextContent, Any, None]
sharepoint2text.read_email__eml_format(file: io.BytesIO, path: str | None = None) -> Generator[EmailContent, Any, None]
sharepoint2text.read_email__msg_format(file: io.BytesIO, path: str | None = None) -> Generator[EmailContent, Any, None]
sharepoint2text.read_email__mbox_format(file: io.BytesIO, path: str | None = None) -> Generator[EmailContent, Any, None]All content types implement the common interface:
class ExtractionInterface(Protocol):
def iterate_units() -> Iterator[UnitInterface] # Iterate over logical units
def iterate_images() -> Generator[ImageInterface, None, None]
def iterate_tables() -> Generator[TableInterface, None, None]
def get_full_text() -> str # Complete text as string
def get_metadata() -> FileMetadataInterface # Metadata with to_dict()
def to_json() -> dict # JSON-serializable representation
@classmethod
def from_json(data: dict) -> "ExtractionInterface"result.metadata # DocxMetadata (title, author, created, modified, ...)
result.paragraphs # List[DocxParagraph] (text, style, runs with formatting)
result.tables # List[List[List[str]]] (cell data)
result.images # List[DocxImage] (filename, content_type, data, size_bytes)
result.headers # List[DocxHeaderFooter]
result.footers # List[DocxHeaderFooter]
result.hyperlinks # List[DocxHyperlink] (text, url)
result.footnotes # List[DocxNote] (id, text)
result.endnotes # List[DocxNote]
result.comments # List[DocxComment] (author, date, text)
result.sections # List[DocxSection] (page dimensions, margins)
result.full_text # str (pre-computed full text)result.metadata # DocMetadata (title, author, num_pages, num_words, num_chars, ...)
result.main_text # str (main document body)
result.footnotes # str (concatenated footnotes)
result.headers_footers # str (concatenated headers/footers)
result.annotations # str (concatenated annotations)result.metadata # XlsxMetadata / XlsMetadata (title, creator, created, modified, ...)
result.sheets # List[XlsxSheet / XlsSheet]
# Each sheet:
sheet.name # str (sheet name)
sheet.data # List[Dict[str, Any]] (rows as dictionaries)
sheet.text # str (text representation)result.metadata # PptxMetadata (title, author, created, modified, ...)
result.slides # List[PPTXSlide]
# Each slide:
slide.slide_number # int (1-indexed)
slide.title # str
slide.footer # str
slide.content_placeholders # List[str] (body content)
slide.other_textboxes # List[str] (free-form text)
slide.images # List[PPTXImage] (filename, content_type, size_bytes, blob)
slide.text # str (pre-computed combined text)result.metadata # PptMetadata (title, author, num_slides, created, modified, ...)
result.slides # List[PptSlideContent]
result.all_text # List[str] (flat list of all text)
# Each slide:
slide.slide_number # int (1-indexed)
slide.title # str | None
slide.body_text # List[str]
slide.other_text # List[str]
slide.notes # List[str] (speaker notes)
slide.text_combined # str (property: title + body + other)
slide.all_text # List[PptTextBlock] (with text_type info)result.metadata # OdpMetadata (title, creator, creation_date, generator, ...)
result.slides # List[OdpSlide]
# Each slide:
slide.slide_number # int (1-indexed)
slide.name # str (slide name)
slide.title # str
slide.body_text # List[str]
slide.other_text # List[str]
slide.tables # List[List[List[str]]] (tables on slide)
slide.annotations # List[OdpAnnotation] (comments)
slide.images # List[OdpImage] (embedded images with href, name, data, size_bytes)
slide.notes # List[str] (speaker notes)
slide.text_combined # str (property: title + body + other)result.metadata # OdsMetadata (title, creator, creation_date, generator, ...)
result.sheets # List[OdsSheet]
# Each sheet:
sheet.name # str (sheet name)
sheet.data # List[Dict[str, Any]] (row data with column keys A, B, C, ...)
sheet.text # str (tab-separated cell values, newline-separated rows)
sheet.annotations # List[OdsAnnotation] (cell comments)
sheet.images # List[OdsImage] (embedded images)result.metadata # PdfMetadata (total_pages)
result.pages # List[PdfPage]
# Each page:
page.text # str
page.images # List[PdfImage] (index, name, width, height, data, format)
page.tables # List[List[List[str]]]result.content # str (full file content)
result.metadata # FileMetadataInterface (filename, file_extension, file_path, folder_path)result.from_email # EmailAddress (name, address)
result.to_emails # List[EmailAddress]
result.to_cc # List[EmailAddress]
result.to_bcc # List[EmailAddress]
result.reply_to # List[EmailAddress]
result.subject # str
result.in_reply_to # str (message ID of parent email)
result.body_plain # str (plain text body)
result.body_html # str (HTML body)
result.metadata # EmailMetadata (date, message_id, plus file metadata)
# EmailAddress structure:
email.name # str (display name)
email.address # str (email address)result.content # str (plain text content)
result.tables # List[List[List[str]]] (table cell values)
result.headings # List[Dict[str, str]] (level/text)
result.links # List[Dict[str, str]] (text/href)
result.metadata # HtmlMetadata (title, language, charset, ...)import sharepoint2text
from pathlib import Path
def extract_all_documents(folder: Path) -> dict[str, list[str]]:
"""Extract text from all supported files in a folder."""
results = {}
for file_path in folder.rglob("*"):
if sharepoint2text.is_supported_file(str(file_path)):
try:
# Collect all content from the generator (handles mbox with multiple emails)
texts = [result.get_full_text() for result in sharepoint2text.read_file(file_path)]
results[str(file_path)] = texts
except Exception as e:
print(f"Failed to extract {file_path}: {e}")
return resultsimport sharepoint2text
# From PDF
result = next(sharepoint2text.read_file("document.pdf"))
for page_num, page in enumerate(result.pages, start=1):
for img in page.images:
with open(f"page{page_num}_{img.name}.{img.format}", "wb") as out:
out.write(img.data)
# From PowerPoint
result = next(sharepoint2text.read_file("slides.pptx"))
for slide in result.slides:
for img in slide.images:
with open(img.filename, "wb") as out:
out.write(img.blob)
# From Word
result = next(sharepoint2text.read_file("document.docx"))
for img in result.images:
if img.data:
with open(img.filename, "wb") as out:
out.write(img.data.getvalue())import sharepoint2text
# Process a single email file (.eml or .msg)
email = next(sharepoint2text.read_file("message.eml"))
print(f"From: {email.from_email.name} <{email.from_email.address}>")
print(f"Subject: {email.subject}")
print(f"Date: {email.metadata.date}")
print(f"Body:\n{email.body_plain}")
# Process a mailbox with multiple emails (.mbox)
for i, email in enumerate(sharepoint2text.read_file("archive.mbox")):
print(f"\n--- Email {i + 1} ---")
print(f"From: {email.from_email.address}")
print(f"To: {', '.join(e.address for e in email.to_emails)}")
print(f"Subject: {email.subject}")
if email.to_cc:
print(f"CC: {', '.join(e.address for e in email.to_cc)}")import sharepoint2text
def prepare_for_rag(file_path: str) -> list[dict]:
"""Prepare document chunks for RAG ingestion."""
chunks = []
# Handle all content items from the generator
for result in sharepoint2text.read_file(file_path):
meta = result.get_metadata()
for i, unit in enumerate(result.iterate_units()):
if unit.get_text().strip(): # Skip empty units
chunks.append({
"text": unit.get_text(),
"metadata": {
"source": file_path,
"chunk_index": i,
"author": getattr(meta, "author", None),
"title": getattr(meta, "title", None),
}
})
return chunksThis integration is optional; you can use sharepoint2text with any storage backend. When using sharepoint_io, you still orchestrate download and extraction (as shown below).
import io
from datetime import datetime, timedelta, timezone
import sharepoint2text
from sharepoint2text.sharepoint_io import (
EntraIDAppCredentials,
FileFilter,
SharePointRestClient,
)
# Configure SharePoint access
credentials = EntraIDAppCredentials(
tenant_id="your-tenant-id",
client_id="your-client-id",
client_secret="your-client-secret",
)
client = SharePointRestClient(
site_url="https://contoso.sharepoint.com/sites/Documents",
credentials=credentials,
)
# Delta sync: process files modified in the last 7 days
one_week_ago = datetime.now(timezone.utc) - timedelta(days=7)
file_filter = FileFilter(
modified_after=one_week_ago,
extensions=[".docx", ".pdf", ".pptx"],
)
for file_meta in client.list_files_filtered(file_filter):
# Download and extract
content = client.download_file(file_meta.id)
extractor = sharepoint2text.get_extractor(file_meta.name)
for result in extractor(io.BytesIO(content), path=file_meta.name):
print(f"File: {file_meta.get_full_path()}")
print(f"Text: {result.get_full_text()[:200]}...")ExtractionFileFormatNotSupportedError: Raised when no extractor exists for a given file type (e.g., unsupported extension/MIME mapping in the router).ExtractionFileEncryptedError: Raised when an extractor detects encryption or password protection (e.g., encrypted PDF, OOXML/ODF password-protected files, legacy Office with FILEPASS/encryption flags).LegacyMicrosoftParsingError: Raised when legacy Office parsing fails for non-encryption reasons (corrupt OLE streams, invalid headers, or unsupported legacy variations).
Apache 2.0 - see LICENSE for details.
This project is not affiliated with, endorsed by, or sponsored by Microsoft.