docprims

Content extraction from documents in a permissive licensing toolkit.

Overview

docprims extracts text from documents with permissive licensing throughout. Use it to build content analysis tools, search indexers, or document processors that can be statically linked and embedded in commercial software.

Supported Formats

Format	Crate	Status
DOCX	`docprims-ooxml`	Planned
XLSX	`docprims-ooxml`	Planned
PPTX	`docprims-ooxml`	Planned
PDF	`docprims-pdf`	Future
Markdown	`docprims-text`	Planned
HTML	`docprims-text`	Planned
XML	`docprims-text`	Planned

The Problem

You need to extract text from PDFs, Word documents, or spreadsheets. Your options:

Use established libraries — Tools like poppler or mupdf work well, but their copyleft licenses restrict how you can distribute your software
Roll your own — Format specs are complex; PDF alone has decades of edge cases
Pay for commercial solutions — Often expensive and still come with licensing constraints

docprims offers a fourth path: MIT/Apache-2.0 licensed extraction that you can statically link, embed in commercial software, or ship without copyleft obligations.

Who Should Use This

AI/ML Engineers: Building pipelines that ingest documents for training, RAG, or content analysis. You need extraction without license overhead in your data stack.

Platform Teams: Your legal department has opinions about copyleft licenses in your software supply chain. docprims is designed for environments where license hygiene matters.

Document Processing SaaS: Building products that handle customer documents. Embedding a permissively-licensed extractor simplifies your licensing story.

Open Source Projects: Avoiding license compatibility debates. MIT/Apache-2.0 is unambiguous.

Installation

Rust

[dependencies]
docprims-ooxml = "0.1"
docprims-text = "0.1"

Go

go get github.com/3leaps/docprims/bindings/go/docprims

By default, the Go bindings link a vendored static library.

To link the shared library instead (recommended if your application also links another Rust staticlib via cgo):

go build -tags docprims_shared ./...

Runtime search path notes for docprims_shared:

v0.1.4+: Go bindings embed rpath entries for the vendored lib-shared/ directories so local builds can run without extra env vars
For distributing binaries, prefer bundling the shared library next to your executable and embedding an rpath like @executable_path (macOS) / $ORIGIN (Linux)
Linux: set LD_LIBRARY_PATH to include bindings/go/docprims/lib-shared/<platform>
macOS: set DYLD_LIBRARY_PATH to include bindings/go/docprims/lib-shared/<platform>
Windows: add bindings/go/docprims/lib-shared/windows-amd64 to PATH

Note: the Go bindings vendor a Rust staticlib (libdocprims_ffi.a). In some applications that also link another Rust staticlib via cgo, the final link can fail with duplicate Rust runtime symbols (commonly _rust_eh_personality). In that case, use one of:

isolate docprims behind a build tag in the consumer until the link model is resolved
use the docprims CLI as a subprocess
switch one Rust dependency to a shared-library distribution

TypeScript

From a git checkout (v0.1.x). Works with both Node.js and Bun:

cd bindings/typescript/docprims
npm install
npm run build
npm run build:native

From a consumer project using file: protocol:

{
  "dependencies": {
    "@3leaps/docprims": "file:/path/to/docprims/bindings/typescript/docprims"
  }
}

CLI

cargo install docprims-cli

Usage

Rust

use docprims_ooxml::extract_docx;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let text = extract_docx("document.docx")?;
    println!("{}", text.content);
    Ok(())
}

CLI

# Extract text from a document
docprims extract document.docx

# Output as JSON with metadata
docprims extract document.docx --format json --include-metadata

# Extract from multiple formats
docprims extract report.pdf slides.pptx data.xlsx

Architecture

docprims/
├── crates/
│   ├── docprims-core/    # Shared types, traits, errors
│   ├── docprims-text/    # Markdown, HTML, XML
│   ├── docprims-ooxml/   # DOCX, XLSX, PPTX
│   └── docprims-cli/     # CLI binary
├── ffi/
│   └── docprims-ffi/     # C-ABI for language bindings
└── bindings/
    ├── go/               # Go binding
    ├── typescript/       # TypeScript binding (Node.js + Bun)
    └── python/           # Python binding

Development

Project documentation conventions (what is canonical vs planning notes): docs/orientation/sources-of-truth.md

Prerequisites

Rust 1.81+
curl (for bootstrap)

Setup

make bootstrap    # Install development tools
make check        # Run all quality checks
make test         # Run tests

Quality Gates

make fmt          # Format code
make lint         # Run clippy
make deny         # Check licenses (GPL-free enforcement)
make audit        # Security vulnerability scan

Supply Chain

docprims is designed for environments where dependency hygiene matters:

License-clean: All dependencies use MIT, Apache-2.0, or compatible licenses
Auditable: Run cargo tree to inspect the full dependency graph
SBOM-ready: Compatible with cargo sbom
No runtime network calls: All functionality is local

# Check dependencies
cargo deny check licenses

# Audit for vulnerabilities
cargo audit

Prior Art

docprims builds on ideas from others in this space:

poppler — Excellent PDF rendering library (GPL). If copyleft works for your use case, it's battle-tested.
mupdf — High-quality document toolkit (AGPL). Commercial licenses available.
calamine — MIT-licensed Rust XLSX/ODS reader. Good reference for spreadsheet parsing.
pdf-rs — MIT-licensed Rust PDF parsing.

We're not claiming to replace these projects. docprims fills a specific niche: embeddable, license-clean document extraction with first-class bindings.

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE)
MIT License (LICENSE-MIT)

at your option.

Subject to 3 Leaps OSS policies.

Related Projects

sysprims — Process control primitives with the same licensing philosophy (sibling project)

Contributing

See CONTRIBUTING.md for guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github		.github
.goneat		.goneat
bindings		bindings
config/agentic/roles		config/agentic/roles
crates		crates
docs		docs
ffi/docprims-ffi		ffi/docprims-ffi
schemas/v0/extract		schemas/v0/extract
scripts		scripts
testdata		testdata
.editorconfig		.editorconfig
.gitignore		.gitignore
.goneatignore		.goneatignore
.yamlfmt		.yamlfmt
.yamllint		.yamllint
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
MAINTAINERS.md		MAINTAINERS.md
Makefile		Makefile
README.md		README.md
RELEASE_CHECKLIST.md		RELEASE_CHECKLIST.md
RELEASE_NOTES.md		RELEASE_NOTES.md
VERSION		VERSION
cbindgen.toml		cbindgen.toml
deny.toml		deny.toml
go.mod		go.mod
go.work		go.work
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

docprims

Overview

Supported Formats

The Problem

Who Should Use This

Installation

Rust

Go

TypeScript

CLI

Usage

Rust

CLI

Architecture

Development

Prerequisites

Setup

Quality Gates

Supply Chain

Prior Art

License

Related Projects

Contributing

About

Licenses found

Uh oh!

Releases 5

Packages

Contributors 2

Uh oh!

Languages

License

Licenses found

3leaps/docprims

Folders and files

Latest commit

History

Repository files navigation

docprims

Overview

Supported Formats

The Problem

Who Should Use This

Installation

Rust

Go

TypeScript

CLI

Usage

Rust

CLI

Architecture

Development

Prerequisites

Setup

Quality Gates

Supply Chain

Prior Art

License

Related Projects

Contributing

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 2

Uh oh!

Languages

Packages