Skip to content

feat: add schema validation for LLM extracted fields#212

Closed
utkarshqz wants to merge 2 commits intofireform-core:mainfrom
utkarshqz:feat/schema-validation
Closed

feat: add schema validation for LLM extracted fields#212
utkarshqz wants to merge 2 commits intofireform-core:mainfrom
utkarshqz:feat/schema-validation

Conversation

@utkarshqz
Copy link

feat: add schema validation for LLM extracted fields

Summary

This PR adds field-level schema validation to the LLM extraction pipeline, directly addressing GSoC Expected Outcome #1 which requires "improved AI extraction accuracy through schema validation".

After Mistral extracts values from the transcript, each value is now automatically validated against expected patterns for its field type before being written to the PDF. Validation issues are reported as structured warnings — never as hard failures — ensuring the pipeline remains robust while giving developers visibility into extraction quality.


Closes / Fixes

Closes #114
Addresses #173 — hallucination detection catches repeated values across fields
Addresses #186 — LLM test coverage now at 40 tests (was 0)


Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

What changed and why

1. 🔍 validate_extracted_fields() — new method in src/llm.py

Called automatically inside main_loop() after every extraction. Runs 5 checks:

Check What it does Example
Phone format Validates digits, dashes, brackets, spaces "not-a-phone" → warning
Email format Must contain @ and a domain "johndoe" → warning
Date format Matches DD/MM/YYYY, YYYY-MM-DD, etc. "yesterday" → warning
Hallucination detection Same value in 3+ fields = likely hallucination {"f1": "John", "f2": "John", "f3": "John"} → warning
Length guard Values over 500 chars flagged Prevents hallucinated paragraphs filling a name field

Design decisions:

  • Never raises an exception — all issues are warnings, not failures
  • None values are skipped — no false positives for empty fields
  • Warnings stored on instance — accessible via get_validation_warnings()
  • Runs after both batch extraction AND fallback per-field extraction

Real output example (from local testing):

[SCHEMA VALIDATION] All fields passed validation ✓

Or when issues are found:

[SCHEMA VALIDATION] Issues found:
  [SCHEMA] 'email': value 'johndoe' does not look like a valid email address
  [SCHEMA] Possible hallucination — value 'John Smith' appears in 3 fields: ['f1', 'f2', 'f3']

2. 🧪 5 new unit tests — tests/test_llm.py::TestSchemaValidation

Test What it verifies
test_valid_fields_return_no_warnings Clean extraction → empty warnings list
test_invalid_email_flagged Email without @ → warning produced
test_repeated_values_flagged_as_hallucination Same value in 3 fields → hallucination warning
test_null_values_skipped None values → no false positive warnings
test_warnings_stored_on_instance get_validation_warnings() returns correct data

3. 📚 docs/TESTING.md — updated

  • Test count updated from 52 → 57
  • Added TestSchemaValidation section describing all 5 new test cases
  • Explains what each validation check covers

docs/TESTING.md is the single source of truth for the test suite — updated with every PR that adds tests.


How Has This Been Tested?

python -m pytest tests/ -v
57 passed, 14 warnings in 0.35s
  • All 57 tests pass locally ✅
  • Schema validation runs on every main_loop() call ✅
  • Verified no false positives on valid data ✅
  • Verified hallucination detection catches repeated values ✅

Test Configuration:

  • OS: Windows 11
  • Python: 3.11.9
  • pytest: 9.0.2

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

    Note: This PR depends on PR test: replace broken test suite with 52 passing tests #209 (test infrastructure) and PR feat: frontend UI, batch LLM extraction, dynamic PDF labels, API hardening #210 (API + LLM fixes) which are open but not yet merged into main.

@utkarshqz utkarshqz closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT]: Schema Validation + Error Recovery

1 participant