Skip to content

#170- pii masking#172

Open
Cubix33 wants to merge 2 commits intofireform-core:mainfrom
Cubix33:privacy-tokenization
Open

#170- pii masking#172
Cubix33 wants to merge 2 commits intofireform-core:mainfrom
Cubix33:privacy-tokenization

Conversation

@Cubix33
Copy link

@Cubix33 Cubix33 commented Mar 3, 2026

Closes #170

Description

This PR introduces a PrivacyManager middleware to secure sensitive Personally Identifiable Information (PII) during the LLM extraction process.

Technical Changes

  • Added src/privacy.py: Implements a regex-based tokenization engine to swap sensitive data (emails, phone numbers) with cryptographic tokens (e.g., TOKEN_EMAIL_X) before sending text to Ollama. Includes a detokenization method to restore original values.
  • Updated src/file_manipulator.py: Intercepts the extraction pipeline to apply data masking before llm.main_loop() executes, and unmasks the JSON output immediately after. The LLM now only interacts with sanitized text.
  • Refactored src/filler.py: Replaced fragile i += 1 visual-order filling with key-based matching. The filler now matches the LLM's JSON keys directly to the PDF Widget titles (annot.T), preventing data shifting and incorrect field population.
  • Updated src/llm.py: Hardened the system prompt to enforce strict, single-value extraction.

Verification

  • Verified tokenization successfully masks PII in the LLM prompt payload.
  • Verified detokenization accurately restores original data to the extracted dictionary.
  • Verified key-based filling correctly populates the PDF without positional shifting.
  • Tested locally (Docker/CLI).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT]: Cryptographic PII Masking & Tokenization

1 participant