Enrich mode (#8)

azhong-git · web-flow · commit d2d601c1691b · 2025-05-12T19:13:06.000-07:00
diff --git a/README.md b/README.md
@@ -26,26 +26,28 @@
 
 # ⚡️ Lightfeed Extract
 
-Use LLMs to **robustly** extract structured data from HTML and markdown. Used in production by Lightfeed and successfully extracting 10M+ records. Written in Typescript/Node.js.
+Use LLMs to **robustly** extract or enrich structured data from HTML and markdown. Used in production by Lightfeed and successfully extracting 10M+ records. Written in Typescript/Node.js.
 
 ## How It Works
 
 1. **HTML to Markdown Conversion**: If the input is HTML, it's first converted to clean, LLM-friendly markdown. This step can optionally extract only the main content and include images. See [HTML to Markdown Conversion](#html-to-markdown-conversion) section for details. The `convertHtmlToMarkdown` function can also be used standalone.
 
-2. **LLM Processing**: The markdown is sent to an LLM (Google Gemini 2.5 flash or OpenAI GPT-4o mini by default) with a prompt to extract structured data according to your Zod schema. You can set a maximum input token limit to control costs or avoid exceeding the model's context window, and the function will return token usage metrics for each LLM call.
+2. **LLM Processing**: The markdown is sent to an LLM (Google Gemini 2.5 flash or OpenAI GPT-4o mini by default) with a prompt to extract structured data according to your Zod schema or enrich existing data objects. You can set a maximum input token limit to control costs or avoid exceeding the model's context window, and the function will return token usage metrics for each LLM call.
 
-3. **JSON Sanitization**: If the LLM output isn't perfect JSON or doesn't fully match your schema, a sanitization process attempts to recover and fix the data. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays. See [JSON Sanitization](#json-sanitization) for details.
+3. **JSON Sanitization**: If the LLM structured output fails or doesn't fully match your schema, a sanitization process attempts to recover and fix the data. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays. See [JSON Sanitization](#json-sanitization) for details.
 
 4. **URL Validation**: All extracted URLs are validated - handling relative URLs, removing invalid ones, and repairing markdown-escaped links. See [URL Validation](#url-validation) section for details.
 
 ## Why use an LLM extractor?
-🔎 Can reason from context, perform search and return structured answers in addition to extracting content as-is
+💡 Can reason from context and return structured answers in addition to extracting content as-is
+
+🔎 Can search from additional context and enrich existing data objects
 
 ⚡️ No need to manually create custom scraper code for each site
 
 🔁 Resilient to website changes, e.g., HTML structure, CSS selectors, or page layout
 
-💡 LLMs are becoming more accurate and cost-effective
+✅ LLMs are becoming more accurate and cost-effective
 
 ## Installation
 
@@ -59,7 +61,7 @@ While this library provides a robust foundation for data extraction, you might w
 
 - **Persistent Searchable Databases**: Automatically store and manage extracted data in a production-ready vector database
 - **Scheduled Runs, Deduplication and Tracking**: Smart detection and handling of duplicate content across your sources, with automated change tracking
-- **Pagination and Multi-page Extraction**: Follow links to collect complete data from connected pages
+- **Deep Link Extraction**: Follow links to collect complete data from connected pages
 - **Real-time API and Integration**: Query your extracted data through robust API endpoints and integrations
 - **Research Portal**: Explore and analyze your data through an intuitive interface
 
@@ -68,8 +70,8 @@ While this library provides a robust foundation for data extraction, you might w
 ### Basic Example
 
 ```typescript
-import { extract, ContentFormat, LLMProvider } from 'lightfeed-extract';
-import { z } from 'zod';
+import { extract, ContentFormat, LLMProvider } from "lightfeed-extract";
+import { z } from "zod";
 
 async function main() {
   // Define your schema. We will run one more sanitization process to recover imperfect, failed, or partial LLM outputs into this schema
@@ -100,28 +102,28 @@ async function main() {
     `,
     format: ContentFormat.HTML,
     schema,
-    sourceUrl: 'https://example.com/blog/async-await', // Required for HTML format to handle relative URLs
-    googleApiKey: 'your-google-api-key'
+    sourceUrl: "https://example.com/blog/async-await", // Required for HTML format to handle relative URLs
+    googleApiKey: "your-google-gemini-api-key",
   });
 
-  console.log('Extracted Data:', result.data);
-  console.log('Token Usage:', result.usage);
+  console.log("Extracted Data:", result.data);
+  console.log("Token Usage:", result.usage);
 }
 
 main().catch(console.error);
 ```
 
-### Extracting from Markdown
+### Extracting from Markdown or Plain Text
 
 You can also extract structured data directly from Markdown string:
 
 ```typescript
 const result = await extract({
   content: markdownContent,
+  // Specify that content is Markdown. In addition to HTML and Markdown, you can also extract plain text by ContentFormat.TXT
   format: ContentFormat.MARKDOWN,
   schema: mySchema,
-  provider: LLMProvider.OPENAI,
-  openaiApiKey: 'your-openai-api-key'
+  googleApiKey: "your-google-gemini-api-key",
 });
 ```
 
@@ -131,31 +133,73 @@ You can provide a custom prompt to guide the extraction process:
 
 ```typescript
 const result = await extract({
-  content: textContent,
-  format: ContentFormat.TXT,
+  content: htmlContent,
+  format: ContentFormat.HTML,
   schema: mySchema,
-  prompt: "Extract only products that are on sale or have special discounts. Include their original prices, discounted prices, and all specifications.",
-  provider: LLMProvider.GOOGLE_GEMINI,
-  googleApiKey: 'your-google-api-key'
+  sourceUrl: "https://example.com/products",
+  // In custom prompt, defined what data should be retrieved
+  prompt: "Extract ONLY products that are on sale or have special discounts. Include their original prices, discounted prices, and product URL.",
+  googleApiKey: "your-google-gemini-api-key",
 });
 ```
 
 If no prompt is provided, a default extraction prompt will be used.
 
-### Customizing Model and Managing Token Limits
+### Data Enrichment
+
+You can use the `dataToEnrich` option to provide an existing data object that will be enriched with additional information from the content. This is particularly useful for:
+
+- Updating incomplete records with missing information
+- Enhancing existing data with new details from content
+- Merging data from multiple sources
+
+The LLM will be instructed to enrich the provided object rather than creating a completely new one:
+
+```typescript
+// Example of enriching a product record with missing information
+const productToEnrich = {
+  productUrl: "https://example.com/products/smart-security-camera",
+  name: "",
+  price: 0,
+  reviews: [],
+};
+
+const result = await extract({
+  content: htmlContent,
+  format: ContentFormat.HTML,
+  schema: productSchema,
+  sourceUrl: "https://example.com/products/smart-security-camera",
+  prompt: "Enrich the product data with complete details from the product page.",
+  dataToEnrich: productToEnrich,
+  googleApiKey: "your-google-gemini-api-key",
+});
+
+// Result will contain the original data enriched with information from the content
+console.log(result.data);
+// {
+//   productUrl: "https://example.com/products/smart-security-camera" // Preserved from original object
+//   name: "Smart Security Camera", // Enriched from the product page
+//   price: 74.50, // Enriched from the product page
+//   reviews: ["I really like this camera", ...] // Reviews enriched from the product page
+// }
+```
+
+### Customizing LLM Provider and Managing Token Limits
 
-You can customize the model and manage token limits to control costs and ensure your content fits within the model's maximum context window:
+You can customize LLM and manage token limits to control costs and ensure your content fits within the model's maximum context window:
 
 ```typescript
 // Extract from Markdown with token limit
 const result = await extract({
   content: markdownContent,
   format: ContentFormat.MARKDOWN,
   schema,
+  // Provide model provider and model name
   provider: LLMProvider.OPENAI,
-  openaiApiKey: 'your-openai-api-key',
-  modelName: 'gpt-4o',
-  maxInputTokens: 128000 // Limit to roughly 128K tokens (max input for gpt-4o-mini)
+  modelName: "gpt-4o-mini",
+  openaiApiKey: "your-openai-api-key",
+  // Limit to roughly 128K tokens (max input for gpt-4o-mini)
+  maxInputTokens: 128000,
 });
 ```
 
@@ -171,7 +215,7 @@ const result = await extract({
   htmlExtractionOptions: {
     extractMainHtml: true // Uses heuristics to remove navigation, headers, footers, etc.
   },
-  sourceUrl: sourceUrl
+  sourceUrl,
 });
 ```
 
@@ -234,13 +278,14 @@ Main function to extract structured data from content.
 | `schema` | `z.ZodTypeAny` | Zod schema defining the structure to extract | Required |
 | `prompt` | `string` | Custom prompt to guide the extraction process | Internal default prompt |
 | `provider` | `LLMProvider` | LLM provider (GOOGLE_GEMINI or OPENAI) | `LLMProvider.GOOGLE_GEMINI` |
-| `modelName` | `string` | Model name to use | Provider-specific default |
-| `googleApiKey` | `string` | Google API key (if using Google Gemini provider) | From env `GOOGLE_API_KEY` |
+| `modelName` | `string` | Model name to use | Provider-specific default, Google Gemini 2.5 flash or OpenAI GPT-4o mini  |
+| `googleApiKey` | `string` | Google Gemini API key (if using Google Gemini provider) | From env `GOOGLE_API_KEY` |
 | `openaiApiKey` | `string` | OpenAI API key (if using OpenAI provider) | From env `OPENAI_API_KEY` |
 | `temperature` | `number` | Temperature for the LLM (0-1) | `0` |
 | `htmlExtractionOptions` | `HTMLExtractionOptions` | HTML-specific options for content extraction (see below) | `{}` |
 | `sourceUrl` | `string` | URL of the HTML content, required when format is HTML to properly handle relative URLs | Required for HTML format |
 | `maxInputTokens` | `number` | Maximum number of input tokens to send to the LLM. Uses a rough conversion of 4 characters per token. When specified, content will be truncated if the total prompt size exceeds this limit. | `undefined` |
+| `dataToEnrich` | `Record<string, any>` | Original data object to enrich with information from the content. When provided, the LLM will be instructed to update this object rather than creating a new one from scratch. | `undefined` |
 
 #### HTML Extraction Options
 
@@ -288,10 +333,10 @@ The function returns a string containing the markdown conversion of the HTML con
 #### Example
 
 ```typescript
-import { convertHtmlToMarkdown, HTMLExtractionOptions } from 'lightfeed-extract';
+import { convertHtmlToMarkdown, HTMLExtractionOptions } from "lightfeed-extract";
 
 // Basic conversion
-const markdown = convertHtmlToMarkdown('<h1>Hello World</h1><p>This is a test</p>');
+const markdown = convertHtmlToMarkdown("<h1>Hello World</h1><p>This is a test</p>");
 console.log(markdown);
 // Output: "Hello World\n===========\n\nThis is a test"
 
@@ -303,9 +348,9 @@ const options: HTMLExtractionOptions = {
 
 // With source URL to handle relative links
 const markdownWithOptions = convertHtmlToMarkdown(
-  '<div><img src="/images/logo.png" alt="Logo"><a href="/about">About</a></div>',
+  "<div><img src="/images/logo.png" alt="Logo"><a href="/about">About</a></div>",
   options,
-  'https://example.com'
+  "https://example.com"
 );
 console.log(markdownWithOptions);
 // Output: "![Logo](https://example.com/images/logo.png)[About](https://example.com/about)"
@@ -321,8 +366,8 @@ safeSanitizedParser<T>(schema: ZodTypeAny, rawObject: unknown): z.infer<T> | nul
 ```
 
 ```typescript
-import { safeSanitizedParser } from 'lightfeed-extract';
-import { z } from 'zod';
+import { safeSanitizedParser } from "lightfeed-extract";
+import { z } from "zod";
 
 // Define a product catalog schema
 const productSchema = z.object({
@@ -360,7 +405,7 @@ const rawLLMOutput = {
     },
     {
       id: 3,
-      // Missing required 'name' field
+      // Missing required "name" field
       price: 45.99,
       inStock: false
     },
diff --git a/src/extractors.ts b/src/extractors.ts
@@ -72,6 +72,7 @@ interface ExtractionPromptOptions {
   format: string;
   content: string;
   customPrompt?: string;
+  dataToEnrich?: Record<string, any>;
 }
 
 interface TruncateContentOptions extends ExtractionPromptOptions {
@@ -85,20 +86,44 @@ export function generateExtractionPrompt({
   format,
   content,
   customPrompt,
+  dataToEnrich,
 }: ExtractionPromptOptions): string {
   // Base prompt structure that's shared between default and custom prompts
   const extractionTask = customPrompt
     ? `${customPrompt}`
     : "Please extract structured information from the provided context.";
 
-  return `Context information is below:
+  // If dataToEnrich is provided, include it in the prompt for enrichment
+  let promptTemplate = `Context information is below:
 ------
 Format: ${format}
 ---
 ${content}
 ------
 
-You are a data extraction assistant that extracts structured information from the above context.
+`;
+
+  if (dataToEnrich) {
+    promptTemplate += `Format: JSON
+---
+${JSON.stringify(dataToEnrich, null, 2)}
+------
+
+You are a data extraction assistant that extracts structured information from the above context in ${format} and JSON.
+
+Your task is: ${extractionTask}
+
+## Guidelines:
+1. Extract ONLY information explicitly stated in the context
+2. Enrich the original JSON object with information from the context
+3. Fill additional fields based on relevant information in the context
+4. Do not make assumptions or infer missing data
+5. Leave fields empty when information is not present or you are uncertain
+6. Do not include information that appears incomplete or truncated
+
+`;
+  } else {
+    promptTemplate += `You are a data extraction assistant that extracts structured information from the above context.
 
 Your task is: ${extractionTask}
 
@@ -109,9 +134,12 @@ Your task is: ${extractionTask}
 4. Do not include information that appears incomplete or truncated
 5. Follow the required schema exactly
 
-Return only the structured data in valid JSON format and nothing else.
-
 `;
+  }
+
+  promptTemplate += `Return only the structured data in valid JSON format and nothing else.`;
+
+  return promptTemplate;
 }
 
 /**
@@ -122,6 +150,7 @@ export function truncateContent({
   format,
   content,
   customPrompt,
+  dataToEnrich,
   maxTokens,
 }: TruncateContentOptions): string {
   const maxChars = maxTokens * 4;
@@ -131,6 +160,7 @@ export function truncateContent({
     format,
     content,
     customPrompt,
+    dataToEnrich,
   });
 
   // If the full prompt is within limits, return original content
@@ -157,7 +187,8 @@ export async function extractWithLLM<T extends z.ZodTypeAny>(
   temperature: number = 0,
   customPrompt?: string,
   format: string = ContentFormat.MARKDOWN,
-  maxInputTokens?: number
+  maxInputTokens?: number,
+  dataToEnrich?: Record<string, any>
 ): Promise<{ data: z.infer<T>; usage: Usage }> {
   const llm = createLLM(provider, modelName, apiKey, temperature);
   let usage: Usage = {};
@@ -168,6 +199,7 @@ export async function extractWithLLM<T extends z.ZodTypeAny>(
         format,
         content,
         customPrompt,
+        dataToEnrich,
         maxTokens: maxInputTokens,
       })
     : content;
@@ -177,6 +209,7 @@ export async function extractWithLLM<T extends z.ZodTypeAny>(
     format,
     content: truncatedContent,
     customPrompt,
+    dataToEnrich,
   });
 
   try {
diff --git a/src/index.ts b/src/index.ts
@@ -19,6 +19,19 @@ const DEFAULT_MODELS = {
  * Extract structured data from HTML, markdown, or plain text content using an LLM
  *
  * @param options Configuration options for extraction
+ * @param options.content HTML, markdown, or plain text content to extract from
+ * @param options.format Content format (HTML, MARKDOWN, or TXT)
+ * @param options.schema Zod schema defining the structure to extract
+ * @param options.provider LLM provider (GOOGLE_GEMINI or OPENAI)
+ * @param options.modelName Model name to use (provider-specific)
+ * @param options.googleApiKey Google API key (if using Google Gemini provider)
+ * @param options.openaiApiKey OpenAI API key (if using OpenAI provider)
+ * @param options.temperature Temperature for the LLM (0-1)
+ * @param options.prompt Custom prompt to guide the extraction process
+ * @param options.sourceUrl URL of the HTML content (required for HTML format)
+ * @param options.htmlExtractionOptions HTML-specific options for content extraction
+ * @param options.maxInputTokens Maximum number of input tokens to send to the LLM
+ * @param options.dataToEnrich Original data object to enrich with information from the content
  * @returns The extracted data, original content, and usage statistics
  */
 export async function extract<T extends z.ZodTypeAny>(
@@ -80,7 +93,8 @@ export async function extract<T extends z.ZodTypeAny>(
     options.temperature ?? 0,
     options.prompt,
     formatToUse.toString(), // Pass the correct format based on actual content
-    options.maxInputTokens
+    options.maxInputTokens,
+    options.dataToEnrich
   );
 
   // Return the full result
diff --git a/src/types.ts b/src/types.ts
@@ -80,6 +80,9 @@ export interface ExtractorOptions<T extends z.ZodTypeAny> {
 
   /** Maximum number of input tokens to send to the LLM. Uses a rough conversion of 4 characters per token. */
   maxInputTokens?: number;
+
+  /** Original data object to enrich with extracted information. When provided, the LLM will be instructed to enrich this object with additional information from the content. */
+  dataToEnrich?: Record<string, any>;
 }
 
 /**
diff --git a/tests/integration/extract.test.ts b/tests/integration/extract.test.ts
diff --git a/tests/unit/extractors.test.ts b/tests/unit/extractors.test.ts

Original file line number	Diff line number	Diff line change
`@@ -80,6 +80,9 @@ export interface ExtractorOptions<T extends z.ZodTypeAny> {`
`80`	`80`
`81`	`81`	`/** Maximum number of input tokens to send to the LLM. Uses a rough conversion of 4 characters per token. */`
`82`	`82`	`maxInputTokens?: number;`
	`83`	`+`
	`84`	`+ /** Original data object to enrich with extracted information. When provided, the LLM will be instructed to enrich this object with additional information from the content. */`
	`85`	`+ dataToEnrich?: Record<string, any>;`
`83`	`86`	`}`
`84`	`87`
`85`	`88`	`/**`