You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+80-35Lines changed: 80 additions & 35 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,26 +26,28 @@
26
26
27
27
# ⚡️ Lightfeed Extract
28
28
29
-
Use LLMs to **robustly** extract structured data from HTML and markdown. Used in production by Lightfeed and successfully extracting 10M+ records. Written in Typescript/Node.js.
29
+
Use LLMs to **robustly** extract or enrich structured data from HTML and markdown. Used in production by Lightfeed and successfully extracting 10M+ records. Written in Typescript/Node.js.
30
30
31
31
## How It Works
32
32
33
33
1.**HTML to Markdown Conversion**: If the input is HTML, it's first converted to clean, LLM-friendly markdown. This step can optionally extract only the main content and include images. See [HTML to Markdown Conversion](#html-to-markdown-conversion) section for details. The `convertHtmlToMarkdown` function can also be used standalone.
34
34
35
-
2.**LLM Processing**: The markdown is sent to an LLM (Google Gemini 2.5 flash or OpenAI GPT-4o mini by default) with a prompt to extract structured data according to your Zod schema. You can set a maximum input token limit to control costs or avoid exceeding the model's context window, and the function will return token usage metrics for each LLM call.
35
+
2.**LLM Processing**: The markdown is sent to an LLM (Google Gemini 2.5 flash or OpenAI GPT-4o mini by default) with a prompt to extract structured data according to your Zod schema or enrich existing data objects. You can set a maximum input token limit to control costs or avoid exceeding the model's context window, and the function will return token usage metrics for each LLM call.
36
36
37
-
3.**JSON Sanitization**: If the LLM output isn't perfect JSON or doesn't fully match your schema, a sanitization process attempts to recover and fix the data. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays. See [JSON Sanitization](#json-sanitization) for details.
37
+
3.**JSON Sanitization**: If the LLM structured output fails or doesn't fully match your schema, a sanitization process attempts to recover and fix the data. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays. See [JSON Sanitization](#json-sanitization) for details.
38
38
39
39
4.**URL Validation**: All extracted URLs are validated - handling relative URLs, removing invalid ones, and repairing markdown-escaped links. See [URL Validation](#url-validation) section for details.
40
40
41
41
## Why use an LLM extractor?
42
-
🔎 Can reason from context, perform search and return structured answers in addition to extracting content as-is
42
+
💡 Can reason from context and return structured answers in addition to extracting content as-is
43
+
44
+
🔎 Can search from additional context and enrich existing data objects
43
45
44
46
⚡️ No need to manually create custom scraper code for each site
45
47
46
48
🔁 Resilient to website changes, e.g., HTML structure, CSS selectors, or page layout
47
49
48
-
💡 LLMs are becoming more accurate and cost-effective
50
+
✅ LLMs are becoming more accurate and cost-effective
49
51
50
52
## Installation
51
53
@@ -59,7 +61,7 @@ While this library provides a robust foundation for data extraction, you might w
59
61
60
62
-**Persistent Searchable Databases**: Automatically store and manage extracted data in a production-ready vector database
61
63
-**Scheduled Runs, Deduplication and Tracking**: Smart detection and handling of duplicate content across your sources, with automated change tracking
62
-
-**Pagination and Multi-page Extraction**: Follow links to collect complete data from connected pages
64
+
-**Deep Link Extraction**: Follow links to collect complete data from connected pages
63
65
-**Real-time API and Integration**: Query your extracted data through robust API endpoints and integrations
64
66
-**Research Portal**: Explore and analyze your data through an intuitive interface
65
67
@@ -68,8 +70,8 @@ While this library provides a robust foundation for data extraction, you might w
// Define your schema. We will run one more sanitization process to recover imperfect, failed, or partial LLM outputs into this schema
@@ -100,28 +102,28 @@ async function main() {
100
102
`,
101
103
format: ContentFormat.HTML,
102
104
schema,
103
-
sourceUrl: 'https://example.com/blog/async-await', // Required for HTML format to handle relative URLs
104
-
googleApiKey: 'your-google-api-key'
105
+
sourceUrl: "https://example.com/blog/async-await", // Required for HTML format to handle relative URLs
106
+
googleApiKey: "your-google-gemini-api-key",
105
107
});
106
108
107
-
console.log('Extracted Data:', result.data);
108
-
console.log('Token Usage:', result.usage);
109
+
console.log("Extracted Data:", result.data);
110
+
console.log("Token Usage:", result.usage);
109
111
}
110
112
111
113
main().catch(console.error);
112
114
```
113
115
114
-
### Extracting from Markdown
116
+
### Extracting from Markdown or Plain Text
115
117
116
118
You can also extract structured data directly from Markdown string:
117
119
118
120
```typescript
119
121
const result =awaitextract({
120
122
content: markdownContent,
123
+
// Specify that content is Markdown. In addition to HTML and Markdown, you can also extract plain text by ContentFormat.TXT
121
124
format: ContentFormat.MARKDOWN,
122
125
schema: mySchema,
123
-
provider: LLMProvider.OPENAI,
124
-
openaiApiKey: 'your-openai-api-key'
126
+
googleApiKey: "your-google-gemini-api-key",
125
127
});
126
128
```
127
129
@@ -131,31 +133,73 @@ You can provide a custom prompt to guide the extraction process:
131
133
132
134
```typescript
133
135
const result =awaitextract({
134
-
content: textContent,
135
-
format: ContentFormat.TXT,
136
+
content: htmlContent,
137
+
format: ContentFormat.HTML,
136
138
schema: mySchema,
137
-
prompt: "Extract only products that are on sale or have special discounts. Include their original prices, discounted prices, and all specifications.",
138
-
provider: LLMProvider.GOOGLE_GEMINI,
139
-
googleApiKey: 'your-google-api-key'
139
+
sourceUrl: "https://example.com/products",
140
+
// In custom prompt, defined what data should be retrieved
141
+
prompt: "Extract ONLY products that are on sale or have special discounts. Include their original prices, discounted prices, and product URL.",
142
+
googleApiKey: "your-google-gemini-api-key",
140
143
});
141
144
```
142
145
143
146
If no prompt is provided, a default extraction prompt will be used.
144
147
145
-
### Customizing Model and Managing Token Limits
148
+
### Data Enrichment
149
+
150
+
You can use the `dataToEnrich` option to provide an existing data object that will be enriched with additional information from the content. This is particularly useful for:
151
+
152
+
- Updating incomplete records with missing information
153
+
- Enhancing existing data with new details from content
154
+
- Merging data from multiple sources
155
+
156
+
The LLM will be instructed to enrich the provided object rather than creating a completely new one:
157
+
158
+
```typescript
159
+
// Example of enriching a product record with missing information
prompt: "Enrich the product data with complete details from the product page.",
173
+
dataToEnrich: productToEnrich,
174
+
googleApiKey: "your-google-gemini-api-key",
175
+
});
176
+
177
+
// Result will contain the original data enriched with information from the content
178
+
console.log(result.data);
179
+
// {
180
+
// productUrl: "https://example.com/products/smart-security-camera" // Preserved from original object
181
+
// name: "Smart Security Camera", // Enriched from the product page
182
+
// price: 74.50, // Enriched from the product page
183
+
// reviews: ["I really like this camera", ...] // Reviews enriched from the product page
184
+
// }
185
+
```
186
+
187
+
### Customizing LLM Provider and Managing Token Limits
146
188
147
-
You can customize the model and manage token limits to control costs and ensure your content fits within the model's maximum context window:
189
+
You can customize LLM and manage token limits to control costs and ensure your content fits within the model's maximum context window:
148
190
149
191
```typescript
150
192
// Extract from Markdown with token limit
151
193
const result =awaitextract({
152
194
content: markdownContent,
153
195
format: ContentFormat.MARKDOWN,
154
196
schema,
197
+
// Provide model provider and model name
155
198
provider: LLMProvider.OPENAI,
156
-
openaiApiKey: 'your-openai-api-key',
157
-
modelName: 'gpt-4o',
158
-
maxInputTokens: 128000// Limit to roughly 128K tokens (max input for gpt-4o-mini)
199
+
modelName: "gpt-4o-mini",
200
+
openaiApiKey: "your-openai-api-key",
201
+
// Limit to roughly 128K tokens (max input for gpt-4o-mini)
202
+
maxInputTokens: 128000,
159
203
});
160
204
```
161
205
@@ -171,7 +215,7 @@ const result = await extract({
171
215
htmlExtractionOptions: {
172
216
extractMainHtml: true// Uses heuristics to remove navigation, headers, footers, etc.
173
217
},
174
-
sourceUrl: sourceUrl
218
+
sourceUrl,
175
219
});
176
220
```
177
221
@@ -234,13 +278,14 @@ Main function to extract structured data from content.
234
278
|`schema`|`z.ZodTypeAny`| Zod schema defining the structure to extract | Required |
235
279
|`prompt`|`string`| Custom prompt to guide the extraction process | Internal default prompt |
236
280
|`provider`|`LLMProvider`| LLM provider (GOOGLE_GEMINI or OPENAI) |`LLMProvider.GOOGLE_GEMINI`|
237
-
|`modelName`|`string`| Model name to use | Provider-specific default |
238
-
|`googleApiKey`|`string`| Google API key (if using Google Gemini provider) | From env `GOOGLE_API_KEY`|
281
+
|`modelName`|`string`| Model name to use | Provider-specific default, Google Gemini 2.5 flash or OpenAI GPT-4o mini |
282
+
|`googleApiKey`|`string`| Google Gemini API key (if using Google Gemini provider) | From env `GOOGLE_API_KEY`|
239
283
|`openaiApiKey`|`string`| OpenAI API key (if using OpenAI provider) | From env `OPENAI_API_KEY`|
240
284
|`temperature`|`number`| Temperature for the LLM (0-1) |`0`|
241
285
|`htmlExtractionOptions`|`HTMLExtractionOptions`| HTML-specific options for content extraction (see below) |`{}`|
242
286
|`sourceUrl`|`string`| URL of the HTML content, required when format is HTML to properly handle relative URLs | Required for HTML format |
243
287
|`maxInputTokens`|`number`| Maximum number of input tokens to send to the LLM. Uses a rough conversion of 4 characters per token. When specified, content will be truncated if the total prompt size exceeds this limit. |`undefined`|
288
+
|`dataToEnrich`|`Record<string, any>`| Original data object to enrich with information from the content. When provided, the LLM will be instructed to update this object rather than creating a new one from scratch. |`undefined`|
244
289
245
290
#### HTML Extraction Options
246
291
@@ -288,10 +333,10 @@ The function returns a string containing the markdown conversion of the HTML con
/** Maximum number of input tokens to send to the LLM. Uses a rough conversion of 4 characters per token. */
82
82
maxInputTokens?: number;
83
+
84
+
/** Original data object to enrich with extracted information. When provided, the LLM will be instructed to enrich this object with additional information from the content. */
0 commit comments