Skip to content

Commit d2d601c

Browse files
authored
Enrich mode (#8)
1 parent 0b4e6dd commit d2d601c

File tree

6 files changed

+349
-42
lines changed

6 files changed

+349
-42
lines changed

README.md

Lines changed: 80 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -26,26 +26,28 @@
2626

2727
# ⚡️ Lightfeed Extract
2828

29-
Use LLMs to **robustly** extract structured data from HTML and markdown. Used in production by Lightfeed and successfully extracting 10M+ records. Written in Typescript/Node.js.
29+
Use LLMs to **robustly** extract or enrich structured data from HTML and markdown. Used in production by Lightfeed and successfully extracting 10M+ records. Written in Typescript/Node.js.
3030

3131
## How It Works
3232

3333
1. **HTML to Markdown Conversion**: If the input is HTML, it's first converted to clean, LLM-friendly markdown. This step can optionally extract only the main content and include images. See [HTML to Markdown Conversion](#html-to-markdown-conversion) section for details. The `convertHtmlToMarkdown` function can also be used standalone.
3434

35-
2. **LLM Processing**: The markdown is sent to an LLM (Google Gemini 2.5 flash or OpenAI GPT-4o mini by default) with a prompt to extract structured data according to your Zod schema. You can set a maximum input token limit to control costs or avoid exceeding the model's context window, and the function will return token usage metrics for each LLM call.
35+
2. **LLM Processing**: The markdown is sent to an LLM (Google Gemini 2.5 flash or OpenAI GPT-4o mini by default) with a prompt to extract structured data according to your Zod schema or enrich existing data objects. You can set a maximum input token limit to control costs or avoid exceeding the model's context window, and the function will return token usage metrics for each LLM call.
3636

37-
3. **JSON Sanitization**: If the LLM output isn't perfect JSON or doesn't fully match your schema, a sanitization process attempts to recover and fix the data. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays. See [JSON Sanitization](#json-sanitization) for details.
37+
3. **JSON Sanitization**: If the LLM structured output fails or doesn't fully match your schema, a sanitization process attempts to recover and fix the data. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays. See [JSON Sanitization](#json-sanitization) for details.
3838

3939
4. **URL Validation**: All extracted URLs are validated - handling relative URLs, removing invalid ones, and repairing markdown-escaped links. See [URL Validation](#url-validation) section for details.
4040

4141
## Why use an LLM extractor?
42-
🔎 Can reason from context, perform search and return structured answers in addition to extracting content as-is
42+
💡 Can reason from context and return structured answers in addition to extracting content as-is
43+
44+
🔎 Can search from additional context and enrich existing data objects
4345

4446
⚡️ No need to manually create custom scraper code for each site
4547

4648
🔁 Resilient to website changes, e.g., HTML structure, CSS selectors, or page layout
4749

48-
💡 LLMs are becoming more accurate and cost-effective
50+
LLMs are becoming more accurate and cost-effective
4951

5052
## Installation
5153

@@ -59,7 +61,7 @@ While this library provides a robust foundation for data extraction, you might w
5961

6062
- **Persistent Searchable Databases**: Automatically store and manage extracted data in a production-ready vector database
6163
- **Scheduled Runs, Deduplication and Tracking**: Smart detection and handling of duplicate content across your sources, with automated change tracking
62-
- **Pagination and Multi-page Extraction**: Follow links to collect complete data from connected pages
64+
- **Deep Link Extraction**: Follow links to collect complete data from connected pages
6365
- **Real-time API and Integration**: Query your extracted data through robust API endpoints and integrations
6466
- **Research Portal**: Explore and analyze your data through an intuitive interface
6567

@@ -68,8 +70,8 @@ While this library provides a robust foundation for data extraction, you might w
6870
### Basic Example
6971

7072
```typescript
71-
import { extract, ContentFormat, LLMProvider } from 'lightfeed-extract';
72-
import { z } from 'zod';
73+
import { extract, ContentFormat, LLMProvider } from "lightfeed-extract";
74+
import { z } from "zod";
7375

7476
async function main() {
7577
// Define your schema. We will run one more sanitization process to recover imperfect, failed, or partial LLM outputs into this schema
@@ -100,28 +102,28 @@ async function main() {
100102
`,
101103
format: ContentFormat.HTML,
102104
schema,
103-
sourceUrl: 'https://example.com/blog/async-await', // Required for HTML format to handle relative URLs
104-
googleApiKey: 'your-google-api-key'
105+
sourceUrl: "https://example.com/blog/async-await", // Required for HTML format to handle relative URLs
106+
googleApiKey: "your-google-gemini-api-key",
105107
});
106108

107-
console.log('Extracted Data:', result.data);
108-
console.log('Token Usage:', result.usage);
109+
console.log("Extracted Data:", result.data);
110+
console.log("Token Usage:", result.usage);
109111
}
110112

111113
main().catch(console.error);
112114
```
113115

114-
### Extracting from Markdown
116+
### Extracting from Markdown or Plain Text
115117

116118
You can also extract structured data directly from Markdown string:
117119

118120
```typescript
119121
const result = await extract({
120122
content: markdownContent,
123+
// Specify that content is Markdown. In addition to HTML and Markdown, you can also extract plain text by ContentFormat.TXT
121124
format: ContentFormat.MARKDOWN,
122125
schema: mySchema,
123-
provider: LLMProvider.OPENAI,
124-
openaiApiKey: 'your-openai-api-key'
126+
googleApiKey: "your-google-gemini-api-key",
125127
});
126128
```
127129

@@ -131,31 +133,73 @@ You can provide a custom prompt to guide the extraction process:
131133

132134
```typescript
133135
const result = await extract({
134-
content: textContent,
135-
format: ContentFormat.TXT,
136+
content: htmlContent,
137+
format: ContentFormat.HTML,
136138
schema: mySchema,
137-
prompt: "Extract only products that are on sale or have special discounts. Include their original prices, discounted prices, and all specifications.",
138-
provider: LLMProvider.GOOGLE_GEMINI,
139-
googleApiKey: 'your-google-api-key'
139+
sourceUrl: "https://example.com/products",
140+
// In custom prompt, defined what data should be retrieved
141+
prompt: "Extract ONLY products that are on sale or have special discounts. Include their original prices, discounted prices, and product URL.",
142+
googleApiKey: "your-google-gemini-api-key",
140143
});
141144
```
142145

143146
If no prompt is provided, a default extraction prompt will be used.
144147

145-
### Customizing Model and Managing Token Limits
148+
### Data Enrichment
149+
150+
You can use the `dataToEnrich` option to provide an existing data object that will be enriched with additional information from the content. This is particularly useful for:
151+
152+
- Updating incomplete records with missing information
153+
- Enhancing existing data with new details from content
154+
- Merging data from multiple sources
155+
156+
The LLM will be instructed to enrich the provided object rather than creating a completely new one:
157+
158+
```typescript
159+
// Example of enriching a product record with missing information
160+
const productToEnrich = {
161+
productUrl: "https://example.com/products/smart-security-camera",
162+
name: "",
163+
price: 0,
164+
reviews: [],
165+
};
166+
167+
const result = await extract({
168+
content: htmlContent,
169+
format: ContentFormat.HTML,
170+
schema: productSchema,
171+
sourceUrl: "https://example.com/products/smart-security-camera",
172+
prompt: "Enrich the product data with complete details from the product page.",
173+
dataToEnrich: productToEnrich,
174+
googleApiKey: "your-google-gemini-api-key",
175+
});
176+
177+
// Result will contain the original data enriched with information from the content
178+
console.log(result.data);
179+
// {
180+
// productUrl: "https://example.com/products/smart-security-camera" // Preserved from original object
181+
// name: "Smart Security Camera", // Enriched from the product page
182+
// price: 74.50, // Enriched from the product page
183+
// reviews: ["I really like this camera", ...] // Reviews enriched from the product page
184+
// }
185+
```
186+
187+
### Customizing LLM Provider and Managing Token Limits
146188

147-
You can customize the model and manage token limits to control costs and ensure your content fits within the model's maximum context window:
189+
You can customize LLM and manage token limits to control costs and ensure your content fits within the model's maximum context window:
148190

149191
```typescript
150192
// Extract from Markdown with token limit
151193
const result = await extract({
152194
content: markdownContent,
153195
format: ContentFormat.MARKDOWN,
154196
schema,
197+
// Provide model provider and model name
155198
provider: LLMProvider.OPENAI,
156-
openaiApiKey: 'your-openai-api-key',
157-
modelName: 'gpt-4o',
158-
maxInputTokens: 128000 // Limit to roughly 128K tokens (max input for gpt-4o-mini)
199+
modelName: "gpt-4o-mini",
200+
openaiApiKey: "your-openai-api-key",
201+
// Limit to roughly 128K tokens (max input for gpt-4o-mini)
202+
maxInputTokens: 128000,
159203
});
160204
```
161205

@@ -171,7 +215,7 @@ const result = await extract({
171215
htmlExtractionOptions: {
172216
extractMainHtml: true // Uses heuristics to remove navigation, headers, footers, etc.
173217
},
174-
sourceUrl: sourceUrl
218+
sourceUrl,
175219
});
176220
```
177221

@@ -234,13 +278,14 @@ Main function to extract structured data from content.
234278
| `schema` | `z.ZodTypeAny` | Zod schema defining the structure to extract | Required |
235279
| `prompt` | `string` | Custom prompt to guide the extraction process | Internal default prompt |
236280
| `provider` | `LLMProvider` | LLM provider (GOOGLE_GEMINI or OPENAI) | `LLMProvider.GOOGLE_GEMINI` |
237-
| `modelName` | `string` | Model name to use | Provider-specific default |
238-
| `googleApiKey` | `string` | Google API key (if using Google Gemini provider) | From env `GOOGLE_API_KEY` |
281+
| `modelName` | `string` | Model name to use | Provider-specific default, Google Gemini 2.5 flash or OpenAI GPT-4o mini |
282+
| `googleApiKey` | `string` | Google Gemini API key (if using Google Gemini provider) | From env `GOOGLE_API_KEY` |
239283
| `openaiApiKey` | `string` | OpenAI API key (if using OpenAI provider) | From env `OPENAI_API_KEY` |
240284
| `temperature` | `number` | Temperature for the LLM (0-1) | `0` |
241285
| `htmlExtractionOptions` | `HTMLExtractionOptions` | HTML-specific options for content extraction (see below) | `{}` |
242286
| `sourceUrl` | `string` | URL of the HTML content, required when format is HTML to properly handle relative URLs | Required for HTML format |
243287
| `maxInputTokens` | `number` | Maximum number of input tokens to send to the LLM. Uses a rough conversion of 4 characters per token. When specified, content will be truncated if the total prompt size exceeds this limit. | `undefined` |
288+
| `dataToEnrich` | `Record<string, any>` | Original data object to enrich with information from the content. When provided, the LLM will be instructed to update this object rather than creating a new one from scratch. | `undefined` |
244289

245290
#### HTML Extraction Options
246291

@@ -288,10 +333,10 @@ The function returns a string containing the markdown conversion of the HTML con
288333
#### Example
289334

290335
```typescript
291-
import { convertHtmlToMarkdown, HTMLExtractionOptions } from 'lightfeed-extract';
336+
import { convertHtmlToMarkdown, HTMLExtractionOptions } from "lightfeed-extract";
292337

293338
// Basic conversion
294-
const markdown = convertHtmlToMarkdown('<h1>Hello World</h1><p>This is a test</p>');
339+
const markdown = convertHtmlToMarkdown("<h1>Hello World</h1><p>This is a test</p>");
295340
console.log(markdown);
296341
// Output: "Hello World\n===========\n\nThis is a test"
297342

@@ -303,9 +348,9 @@ const options: HTMLExtractionOptions = {
303348

304349
// With source URL to handle relative links
305350
const markdownWithOptions = convertHtmlToMarkdown(
306-
'<div><img src="/images/logo.png" alt="Logo"><a href="/about">About</a></div>',
351+
"<div><img src="/images/logo.png" alt="Logo"><a href="/about">About</a></div>",
307352
options,
308-
'https://example.com'
353+
"https://example.com"
309354
);
310355
console.log(markdownWithOptions);
311356
// Output: "![Logo](https://example.com/images/logo.png)[About](https://example.com/about)"
@@ -321,8 +366,8 @@ safeSanitizedParser<T>(schema: ZodTypeAny, rawObject: unknown): z.infer<T> | nul
321366
```
322367

323368
```typescript
324-
import { safeSanitizedParser } from 'lightfeed-extract';
325-
import { z } from 'zod';
369+
import { safeSanitizedParser } from "lightfeed-extract";
370+
import { z } from "zod";
326371

327372
// Define a product catalog schema
328373
const productSchema = z.object({
@@ -360,7 +405,7 @@ const rawLLMOutput = {
360405
},
361406
{
362407
id: 3,
363-
// Missing required 'name' field
408+
// Missing required "name" field
364409
price: 45.99,
365410
inStock: false
366411
},

src/extractors.ts

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ interface ExtractionPromptOptions {
7272
format: string;
7373
content: string;
7474
customPrompt?: string;
75+
dataToEnrich?: Record<string, any>;
7576
}
7677

7778
interface TruncateContentOptions extends ExtractionPromptOptions {
@@ -85,20 +86,44 @@ export function generateExtractionPrompt({
8586
format,
8687
content,
8788
customPrompt,
89+
dataToEnrich,
8890
}: ExtractionPromptOptions): string {
8991
// Base prompt structure that's shared between default and custom prompts
9092
const extractionTask = customPrompt
9193
? `${customPrompt}`
9294
: "Please extract structured information from the provided context.";
9395

94-
return `Context information is below:
96+
// If dataToEnrich is provided, include it in the prompt for enrichment
97+
let promptTemplate = `Context information is below:
9598
------
9699
Format: ${format}
97100
---
98101
${content}
99102
------
100103
101-
You are a data extraction assistant that extracts structured information from the above context.
104+
`;
105+
106+
if (dataToEnrich) {
107+
promptTemplate += `Format: JSON
108+
---
109+
${JSON.stringify(dataToEnrich, null, 2)}
110+
------
111+
112+
You are a data extraction assistant that extracts structured information from the above context in ${format} and JSON.
113+
114+
Your task is: ${extractionTask}
115+
116+
## Guidelines:
117+
1. Extract ONLY information explicitly stated in the context
118+
2. Enrich the original JSON object with information from the context
119+
3. Fill additional fields based on relevant information in the context
120+
4. Do not make assumptions or infer missing data
121+
5. Leave fields empty when information is not present or you are uncertain
122+
6. Do not include information that appears incomplete or truncated
123+
124+
`;
125+
} else {
126+
promptTemplate += `You are a data extraction assistant that extracts structured information from the above context.
102127
103128
Your task is: ${extractionTask}
104129
@@ -109,9 +134,12 @@ Your task is: ${extractionTask}
109134
4. Do not include information that appears incomplete or truncated
110135
5. Follow the required schema exactly
111136
112-
Return only the structured data in valid JSON format and nothing else.
113-
114137
`;
138+
}
139+
140+
promptTemplate += `Return only the structured data in valid JSON format and nothing else.`;
141+
142+
return promptTemplate;
115143
}
116144

117145
/**
@@ -122,6 +150,7 @@ export function truncateContent({
122150
format,
123151
content,
124152
customPrompt,
153+
dataToEnrich,
125154
maxTokens,
126155
}: TruncateContentOptions): string {
127156
const maxChars = maxTokens * 4;
@@ -131,6 +160,7 @@ export function truncateContent({
131160
format,
132161
content,
133162
customPrompt,
163+
dataToEnrich,
134164
});
135165

136166
// If the full prompt is within limits, return original content
@@ -157,7 +187,8 @@ export async function extractWithLLM<T extends z.ZodTypeAny>(
157187
temperature: number = 0,
158188
customPrompt?: string,
159189
format: string = ContentFormat.MARKDOWN,
160-
maxInputTokens?: number
190+
maxInputTokens?: number,
191+
dataToEnrich?: Record<string, any>
161192
): Promise<{ data: z.infer<T>; usage: Usage }> {
162193
const llm = createLLM(provider, modelName, apiKey, temperature);
163194
let usage: Usage = {};
@@ -168,6 +199,7 @@ export async function extractWithLLM<T extends z.ZodTypeAny>(
168199
format,
169200
content,
170201
customPrompt,
202+
dataToEnrich,
171203
maxTokens: maxInputTokens,
172204
})
173205
: content;
@@ -177,6 +209,7 @@ export async function extractWithLLM<T extends z.ZodTypeAny>(
177209
format,
178210
content: truncatedContent,
179211
customPrompt,
212+
dataToEnrich,
180213
});
181214

182215
try {

src/index.ts

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,19 @@ const DEFAULT_MODELS = {
1919
* Extract structured data from HTML, markdown, or plain text content using an LLM
2020
*
2121
* @param options Configuration options for extraction
22+
* @param options.content HTML, markdown, or plain text content to extract from
23+
* @param options.format Content format (HTML, MARKDOWN, or TXT)
24+
* @param options.schema Zod schema defining the structure to extract
25+
* @param options.provider LLM provider (GOOGLE_GEMINI or OPENAI)
26+
* @param options.modelName Model name to use (provider-specific)
27+
* @param options.googleApiKey Google API key (if using Google Gemini provider)
28+
* @param options.openaiApiKey OpenAI API key (if using OpenAI provider)
29+
* @param options.temperature Temperature for the LLM (0-1)
30+
* @param options.prompt Custom prompt to guide the extraction process
31+
* @param options.sourceUrl URL of the HTML content (required for HTML format)
32+
* @param options.htmlExtractionOptions HTML-specific options for content extraction
33+
* @param options.maxInputTokens Maximum number of input tokens to send to the LLM
34+
* @param options.dataToEnrich Original data object to enrich with information from the content
2235
* @returns The extracted data, original content, and usage statistics
2336
*/
2437
export async function extract<T extends z.ZodTypeAny>(
@@ -80,7 +93,8 @@ export async function extract<T extends z.ZodTypeAny>(
8093
options.temperature ?? 0,
8194
options.prompt,
8295
formatToUse.toString(), // Pass the correct format based on actual content
83-
options.maxInputTokens
96+
options.maxInputTokens,
97+
options.dataToEnrich
8498
);
8599

86100
// Return the full result

src/types.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,9 @@ export interface ExtractorOptions<T extends z.ZodTypeAny> {
8080

8181
/** Maximum number of input tokens to send to the LLM. Uses a rough conversion of 4 characters per token. */
8282
maxInputTokens?: number;
83+
84+
/** Original data object to enrich with extracted information. When provided, the LLM will be instructed to enrich this object with additional information from the content. */
85+
dataToEnrich?: Record<string, any>;
8386
}
8487

8588
/**

0 commit comments

Comments
 (0)