#170- pii masking by Cubix33 · Pull Request #172 · fireform-core/FireForm

Cubix33 · 2026-03-03T17:18:13Z

Closes #170

This PR introduces a PrivacyManager middleware to secure sensitive Personally Identifiable Information (PII) during the LLM extraction process.

Added src/privacy.py: Implements a regex-based tokenization engine to swap sensitive data (emails, phone numbers) with cryptographic tokens (e.g., TOKEN_EMAIL_X) before sending text to Ollama. Includes a detokenization method to restore original values.
Updated src/file_manipulator.py: Intercepts the extraction pipeline to apply data masking before llm.main_loop() executes, and unmasks the JSON output immediately after. The LLM now only interacts with sanitized text.
Refactored src/filler.py: Replaced fragile i += 1 visual-order filling with key-based matching. The filler now matches the LLM's JSON keys directly to the PDF Widget titles (annot.T), preventing data shifting and incorrect field population.
Updated src/llm.py: Hardened the system prompt to enforce strict, single-value extraction.

Verified tokenization successfully masks PII in the LLM prompt payload.
Verified detokenization accurately restores original data to the extracted dictionary.
Verified key-based filling correctly populates the PDF without positional shifting.
Tested locally (Docker/CLI).

Cubix33 added 2 commits March 3, 2026 17:15

fireform-core#170- pii masking

d5f95cd

fix

db4db66

Provide feedback