feat: add schema validation for LLM extracted fields by utkarshqz · Pull Request #212 · fireform-core/FireForm

utkarshqz · 2026-03-09T20:25:46Z

feat: add schema validation for LLM extracted fields

Summary

This PR adds field-level schema validation to the LLM extraction pipeline, directly addressing GSoC Expected Outcome #1 which requires "improved AI extraction accuracy through schema validation".

After Mistral extracts values from the transcript, each value is now automatically validated against expected patterns for its field type before being written to the PDF. Validation issues are reported as structured warnings — never as hard failures — ensuring the pipeline remains robust while giving developers visibility into extraction quality.

Closes / Fixes

Closes #114
Addresses #173 — hallucination detection catches repeated values across fields
Addresses #186 — LLM test coverage now at 40 tests (was 0)

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

What changed and why

1. 🔍 `validate_extracted_fields()` — new method in `src/llm.py`

Called automatically inside main_loop() after every extraction. Runs 5 checks:

Check	What it does	Example
Phone format	Validates digits, dashes, brackets, spaces	`"not-a-phone"` → warning
Email format	Must contain `@` and a domain	`"johndoe"` → warning
Date format	Matches `DD/MM/YYYY`, `YYYY-MM-DD`, etc.	`"yesterday"` → warning
Hallucination detection	Same value in 3+ fields = likely hallucination	`{"f1": "John", "f2": "John", "f3": "John"}` → warning
Length guard	Values over 500 chars flagged	Prevents hallucinated paragraphs filling a name field

Design decisions:

Never raises an exception — all issues are warnings, not failures
None values are skipped — no false positives for empty fields
Warnings stored on instance — accessible via get_validation_warnings()
Runs after both batch extraction AND fallback per-field extraction

Real output example (from local testing):

[SCHEMA VALIDATION] All fields passed validation ✓

Or when issues are found:

[SCHEMA VALIDATION] Issues found:
  [SCHEMA] 'email': value 'johndoe' does not look like a valid email address
  [SCHEMA] Possible hallucination — value 'John Smith' appears in 3 fields: ['f1', 'f2', 'f3']

2. 🧪 5 new unit tests — `tests/test_llm.py::TestSchemaValidation`

Test	What it verifies
`test_valid_fields_return_no_warnings`	Clean extraction → empty warnings list
`test_invalid_email_flagged`	Email without `@` → warning produced
`test_repeated_values_flagged_as_hallucination`	Same value in 3 fields → hallucination warning
`test_null_values_skipped`	`None` values → no false positive warnings
`test_warnings_stored_on_instance`	`get_validation_warnings()` returns correct data

3. 📚 `docs/TESTING.md` — updated

Test count updated from 52 → 57
Added TestSchemaValidation section describing all 5 new test cases
Explains what each validation check covers

docs/TESTING.md is the single source of truth for the test suite — updated with every PR that adds tests.

How Has This Been Tested?

python -m pytest tests/ -v
57 passed, 14 warnings in 0.35s

All 57 tests pass locally ✅
Schema validation runs on every main_loop() call ✅
Verified no false positives on valid data ✅
Verified hallucination detection catches repeated values ✅

Test Configuration:

OS: Windows 11
Python: 3.11.9
pytest: 9.0.2

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

Note: This PR depends on PR test: replace broken test suite with 52 passing tests #209 (test infrastructure) and PR feat: frontend UI, batch LLM extraction, dynamic PDF labels, API hardening #210 (API + LLM fixes) which are open but not yet merged into main.

…ening

utkarshqz added 2 commits March 9, 2026 22:00

feat: frontend UI, batch LLM extraction, dynamic PDF labels, API hard…

ad28847

…ening

feat: add schema validation for LLM extracted fields

ff3f1a1

utkarshqz closed this Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add schema validation for LLM extracted fields#212

feat: add schema validation for LLM extracted fields#212
utkarshqz wants to merge 2 commits intofireform-core:mainfrom
utkarshqz:feat/schema-validation

utkarshqz commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

utkarshqz commented Mar 9, 2026

feat: add schema validation for LLM extracted fields

Summary

Closes / Fixes

Type of change

What changed and why

1. 🔍 validate_extracted_fields() — new method in src/llm.py

2. 🧪 5 new unit tests — tests/test_llm.py::TestSchemaValidation

3. 📚 docs/TESTING.md — updated

How Has This Been Tested?

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. 🔍 `validate_extracted_fields()` — new method in `src/llm.py`

2. 🧪 5 new unit tests — `tests/test_llm.py::TestSchemaValidation`

3. 📚 `docs/TESTING.md` — updated