Content extraction from documents in a permissive licensing toolkit.
docprims extracts text from documents with permissive licensing throughout. Use it to build content analysis tools, search indexers, or document processors that can be statically linked and embedded in commercial software.
| Format | Crate | Status |
|---|---|---|
| DOCX | docprims-ooxml |
Planned |
| XLSX | docprims-ooxml |
Planned |
| PPTX | docprims-ooxml |
Planned |
docprims-pdf |
Future | |
| Markdown | docprims-text |
Planned |
| HTML | docprims-text |
Planned |
| XML | docprims-text |
Planned |
You need to extract text from PDFs, Word documents, or spreadsheets. Your options:
- Use established libraries — Tools like poppler or mupdf work well, but their copyleft licenses restrict how you can distribute your software
- Roll your own — Format specs are complex; PDF alone has decades of edge cases
- Pay for commercial solutions — Often expensive and still come with licensing constraints
docprims offers a fourth path: MIT/Apache-2.0 licensed extraction that you can statically link, embed in commercial software, or ship without copyleft obligations.
AI/ML Engineers: Building pipelines that ingest documents for training, RAG, or content analysis. You need extraction without license overhead in your data stack.
Platform Teams: Your legal department has opinions about copyleft licenses in your software supply chain. docprims is designed for environments where license hygiene matters.
Document Processing SaaS: Building products that handle customer documents. Embedding a permissively-licensed extractor simplifies your licensing story.
Open Source Projects: Avoiding license compatibility debates. MIT/Apache-2.0 is unambiguous.
[dependencies]
docprims-ooxml = "0.1"
docprims-text = "0.1"go get github.com/3leaps/docprims/bindings/go/docprimsBy default, the Go bindings link a vendored static library.
To link the shared library instead (recommended if your application also links another Rust staticlib via cgo):
go build -tags docprims_shared ./...Runtime search path notes for docprims_shared:
-
v0.1.4+: Go bindings embed rpath entries for the vendored
lib-shared/directories so local builds can run without extra env vars -
For distributing binaries, prefer bundling the shared library next to your executable and embedding an rpath like
@executable_path(macOS) /$ORIGIN(Linux) -
Linux: set
LD_LIBRARY_PATHto includebindings/go/docprims/lib-shared/<platform> -
macOS: set
DYLD_LIBRARY_PATHto includebindings/go/docprims/lib-shared/<platform> -
Windows: add
bindings/go/docprims/lib-shared/windows-amd64toPATH
Note: the Go bindings vendor a Rust staticlib (libdocprims_ffi.a). In some applications that also link another Rust
staticlib via cgo, the final link can fail with duplicate Rust runtime symbols (commonly _rust_eh_personality). In that case, use one of:
- isolate docprims behind a build tag in the consumer until the link model is resolved
- use the
docprimsCLI as a subprocess - switch one Rust dependency to a shared-library distribution
From a git checkout (v0.1.x). Works with both Node.js and Bun:
cd bindings/typescript/docprims
npm install
npm run build
npm run build:nativeFrom a consumer project using file: protocol:
{
"dependencies": {
"@3leaps/docprims": "file:/path/to/docprims/bindings/typescript/docprims"
}
}cargo install docprims-cliuse docprims_ooxml::extract_docx;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let text = extract_docx("document.docx")?;
println!("{}", text.content);
Ok(())
}# Extract text from a document
docprims extract document.docx
# Output as JSON with metadata
docprims extract document.docx --format json --include-metadata
# Extract from multiple formats
docprims extract report.pdf slides.pptx data.xlsxdocprims/
├── crates/
│ ├── docprims-core/ # Shared types, traits, errors
│ ├── docprims-text/ # Markdown, HTML, XML
│ ├── docprims-ooxml/ # DOCX, XLSX, PPTX
│ └── docprims-cli/ # CLI binary
├── ffi/
│ └── docprims-ffi/ # C-ABI for language bindings
└── bindings/
├── go/ # Go binding
├── typescript/ # TypeScript binding (Node.js + Bun)
└── python/ # Python binding
Project documentation conventions (what is canonical vs planning notes):
docs/orientation/sources-of-truth.md
- Rust 1.81+
- curl (for bootstrap)
make bootstrap # Install development tools
make check # Run all quality checks
make test # Run testsmake fmt # Format code
make lint # Run clippy
make deny # Check licenses (GPL-free enforcement)
make audit # Security vulnerability scandocprims is designed for environments where dependency hygiene matters:
- License-clean: All dependencies use MIT, Apache-2.0, or compatible licenses
- Auditable: Run
cargo treeto inspect the full dependency graph - SBOM-ready: Compatible with
cargo sbom - No runtime network calls: All functionality is local
# Check dependencies
cargo deny check licenses
# Audit for vulnerabilities
cargo auditdocprims builds on ideas from others in this space:
- poppler — Excellent PDF rendering library (GPL). If copyleft works for your use case, it's battle-tested.
- mupdf — High-quality document toolkit (AGPL). Commercial licenses available.
- calamine — MIT-licensed Rust XLSX/ODS reader. Good reference for spreadsheet parsing.
- pdf-rs — MIT-licensed Rust PDF parsing.
We're not claiming to replace these projects. docprims fills a specific niche: embeddable, license-clean document extraction with first-class bindings.
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Subject to 3 Leaps OSS policies.
- sysprims — Process control primitives with the same licensing philosophy (sibling project)
See CONTRIBUTING.md for guidelines.