|
28 | 28 |
|
29 | 29 | Use LLMs to **robustly** extract structured data from HTML and markdown. Used in production by Lightfeed and successfully extracting 10M+ records. Written in Typescript/Node.js. |
30 | 30 |
|
31 | | -## Key Features |
32 | | -✅ **Sanitize and recover imperfect, failed, or partial LLM outputs into valid JSON** - Ensures outputs conform to your schema defined in Zod, especially for complex schemas with deeply nested objects and arrays. See [JSON Sanitization](#json-sanitization) for details. |
| 31 | +## How It Works |
33 | 32 |
|
34 | | -🔗 **Robust URL extraction** - Validates URLs, handles relative/absolute paths, skips invalid URLs and fixes markdown-escaped links automatically. See [URL Validation](#url-validation) section for details. |
| 33 | +1. **HTML to Markdown Conversion**: If the input is HTML, it's first converted to clean, LLM-friendly markdown. This step can optionally extract only the main content and include images. See [HTML to Markdown Conversion](#html-to-markdown-conversion) section for details. The `convertHtmlToMarkdown` function can also be used standalone. |
35 | 34 |
|
36 | | -## Other Features |
37 | | -- Convert HTML to LLM-ready markdown, with option to extract only the main content from HTML (e.g. removing navigation, headers & footers) and option to include images. The `convertHtmlToMarkdown` function is exposed as a top-level utility that can be used independently without running the full LLM extraction pipeline. See [HTML to Markdown Conversion](#html-to-markdown-conversion) section for details |
38 | | -- Extract structured data using Google Gemini or OpenAI models (Gemini 2.5 flash and GPT-4o mini by default), option to truncate to max input token limit |
39 | | -- Support for custom extraction prompts |
40 | | -- Return token usage per each call |
41 | | -- Extensive unit tests and integration tests to ensure production reliability |
| 35 | +2. **LLM Processing**: The markdown is sent to an LLM (Google Gemini 2.5 flash or OpenAI GPT-4o mini by default) with a prompt to extract structured data according to your Zod schema. You can set a maximum input token limit to control costs or avoid exceeding the model's context window, and the function will return token usage metrics for each LLM call. |
| 36 | + |
| 37 | +3. **JSON Sanitization**: If the LLM output isn't perfect JSON or doesn't fully match your schema, a sanitization process attempts to recover and fix the data. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays. See [JSON Sanitization](#json-sanitization) for details. |
| 38 | + |
| 39 | +4. **URL Validation**: All extracted URLs are validated - handling relative URLs, removing invalid ones, and repairing markdown-escaped links. See [URL Validation](#url-validation) section for details. |
42 | 40 |
|
43 | 41 | ## Why use an LLM extractor? |
44 | 42 | 🔎 Can reason from context, perform search and return structured answers in addition to extracting content as-is |
|
0 commit comments