Skip to content

SStephanJX/Snowflake-RAG-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snowflake RAG System with Type-Specific Document Processing

Overview

A production-ready Retrieval-Augmented Generation (RAG) system built on Snowflake that intelligently processes different document types using specialized chunking strategies. The system uses Snowflake Cortex for embeddings and LLM completion, with a focus on semantic accuracy and extensibility.

Key Features

  • Multi-Strategy Document Processing: Different chunking approaches for different document types
  • Section-Based Resume Chunking: Preserves semantic context by chunking resumes by logical sections
  • Vector Similarity Search: Uses Snowflake Cortex embeddings (768-dimensional Arctic embeddings)
  • Type-Extensible Architecture: Easy to add new document types with custom processing strategies
  • Automated Batch Processing: Cursor-based iteration handles multiple documents efficiently
  • Query Analytics: Built-in logging and performance tracking

Architecture

Database Schema

Core Tables

DOCUMENT_STAGING

  • Temporary storage for uploaded documents before processing
  • Tracks processing status and document type
  • Fields: staging_id, doc_name, doc_type, content, document_type_id, uploaded_at, processed

DOCUMENTS

  • Final storage for processed document chunks with embeddings
  • Fields: doc_id, doc_name, doc_type, content, chunk_text, chunk_index, total_chunks, embedding (VECTOR), metadata (VARIANT), created_at, updated_at

RAG_DOCUMENT_TYPE

  • Registry of document types and their processing configurations
  • Fields: typeid, type_name, type_description, chunking_function, chunk_size, chunk_overlap, is_active, created_at

QUERY_LOGS

  • Audit trail of all RAG queries with performance metrics
  • Fields: query_id, user_query, retrieved_chunks, response_text, execution_time_ms, model_used, created_at

Document Types

Type Chunking Strategy Function Purpose
Resume Section-based CHUNK_RESUME_SIMPLE Preserves semantic context of resume sections (Education, Experience, Skills, etc.)
General Size-based CHUNK_TEXT_PYTHON Standard overlapping chunks for general documents
Job Description TBD CHUNK_TEXT_PYTHON (placeholder) Ready for custom implementation

Key Components

Chunking Functions

CHUNK_RESUME_SIMPLE(content VARCHAR)

SQL-based UDTF that intelligently chunks resumes by detecting section headers:

  • Professional Summary
  • Technical Skills
  • Experience
  • Professional Development
  • Education

Returns: chunk_text, chunk_index, section_name

Key Features:

  • Preserves complete sections as semantic units
  • Filters out short/empty sections
  • Maintains section metadata for context

CHUNK_TEXT_PYTHON(content VARCHAR, chunk_size INT, chunk_overlap INT)

Python UDTF for standard text chunking:

  • Configurable chunk size and overlap
  • Attempts to break at natural boundaries (sentences, paragraphs)
  • Safety limits (max 1000 chunks per document)

Returns: chunk_text, chunk_index

Core Procedures

PROCESS_STAGED_DOCUMENTS_TYPED()

Automated batch processor for multiple document types.

Algorithm:

  1. Query unprocessed documents with their type configuration
  2. Use cursor iteration to process each document individually
  3. Route to appropriate chunking function based on type_name
  4. Generate vector embeddings using Snowflake Cortex
  5. Store chunks with metadata
  6. Mark document as processed

Critical Implementation Detail: Uses variable binding (:content_var) rather than column references to avoid Snowflake UDTF correlation limitations.

Returns: Processing summary with counts by document type

ASK_QUESTION(user_question VARCHAR, top_k INT, model_name VARCHAR)

RAG query processor.

Algorithm:

  1. Generate query embedding using Cortex
  2. Vector similarity search to find top-k relevant chunks (threshold: 0.3)
  3. Aggregate chunks into context with source attribution
  4. Submit to Cortex LLM with context-aware prompt
  5. Return answer with source citations

Returns: answer, sources (array), query

EXECUTE_RAG(user_question VARCHAR, top_k INT, model_name VARCHAR)

Production wrapper around ASK_QUESTION with logging.

Features:

  • Performance timing
  • Query logging to QUERY_LOGS table
  • Error handling

Returns: answer, sources

Analytics Views

DOCUMENT_STATS

  • Summary statistics per document
  • Chunk counts, average chunk length
  • First indexed and last updated timestamps

QUERY_ANALYTICS

  • Daily query aggregations
  • Average response times
  • Chunks retrieved per query
  • Breakdown by model used

Technical Challenges Solved

Snowflake UDTF Correlation Limitation

Problem: Snowflake does not support correlated table functions. This pattern fails:

SELECT chunk_text
FROM document_staging s,
     TABLE(CHUNK_FUNCTION(s.content)) chunks  -- ❌ FAILS
WHERE s.doc_name = 'example';

Error: "Unsupported subquery type cannot be evaluated"

Solution: Use cursor-based iteration with variable binding:

FOR doc IN cursor DO
    INSERT INTO documents
    SELECT ...
    FROM TABLE(CHUNK_FUNCTION(:doc.content))  -- âś… WORKS
END FOR;

Single Document Processing Pattern

For processing a single known document, use a scalar subquery:

INSERT INTO documents (...)
WITH chunked_data AS (
    SELECT 
        chunks.*,
        COUNT(*) OVER () as total_chunks
    FROM TABLE(CHUNK_RESUME_SIMPLE(
        (SELECT content FROM document_staging WHERE doc_name = 'Target Doc')
    )) chunks
    CROSS JOIN (SELECT doc_name, doc_type, content 
                FROM document_staging 
                WHERE doc_name = 'Target Doc') s
)
SELECT ... FROM chunked_data;

Testing

Note: The test scenarios below use the author's actual resume data. When testing with your own documents, you'll get different results based on your content, but the structure and process will be the same.

Test Scenarios Executed

1. Resume Section-Based Chunking

Objective: Verify education information is preserved as a complete semantic unit

Test Data: Resume for "Stephen Stephan" containing:

  • Jones College coursework (Computer Science, Business Administration)
  • Florida State College at Jacksonville coursework (Computer Programming, Information Systems)
  • Terry Parker High School diploma

Test Query: "What is Stephen's education background?"

Test Execution:

CALL RAG_SCHEMA.ASK_QUESTION('What is Stephen''s education background?', 5, 'llama3-8b');

Expected Result: Complete education history returned in context

Actual Result: âś… PASS

Answer: According to the provided context, Stephen Stephan's education background is as follows:
* Jones College, Jacksonville, FL: Coursework in Computer Science and Business Administration
* Florida State College at Jacksonville, Jacksonville, FL: Coursework in Computer Programming and Information Systems  
* Terry Parker High School, Jacksonville, FL: High School Diploma

Sources:
- Chunk 3: similarity 0.686
- Chunk 2: similarity 0.642
- Chunk 5: similarity 0.583
- Chunk 6: similarity 0.419 (Education section)
- Chunk 4: similarity 0.395

Sources:

[
  {
    "chunk": 3,
    "document": "Stephen Stephan Resume", 
    "similarity": 0.686
  },
  {
    "chunk": 2,
    "document": "Stephen Stephan Resume",
    "similarity": 0.642
  },
  {
    "chunk": 5, 
    "document": "Stephen Stephan Resume",
    "similarity": 0.583
  },
  {
    "chunk": 6,
    "document": "Stephen Stephan Resume",
    "similarity": 0.419
  },
  {
    "chunk": 4,
    "document": "Stephen Stephan Resume",
    "similarity": 0.395
  }
]

Validation:

  • All three educational institutions retrieved
  • Complete coursework details included
  • High similarity scores (0.419-0.686)
  • Education section (chunk 6) included in top 5 results

2. Automated Multi-Document Processing

Objective: Verify automated processor handles type-specific chunking

Test Setup:

  1. Reset document processing status:

    UPDATE DOCUMENT_STAGING SET processed = FALSE WHERE doc_name = 'Stephen Stephan Resume';
    DELETE FROM DOCUMENTS WHERE doc_name = 'Stephen Stephan Resume';
  2. Execute automated processor:

    CALL PROCESS_STAGED_DOCUMENTS_TYPED();

Expected Result: Resume processed with section-based chunking, marked as processed

Actual Result: âś… PASS

Return: "Processed 1 resumes and 0 general documents"

Verification:
- 5 chunks created (indices 2-6)
- Sections: EXPERIENCE, TECHNICAL_SKILLS, PROFESSIONAL_DEVELOPMENT, EDUCATION
- Document marked processed = TRUE
- document_type_id = 2 (Resume)

3. Chunk Structure Validation

Objective: Verify chunk metadata and structure

Query:

SELECT 
    chunk_index,
    metadata:section_name::STRING as section,
    total_chunks,
    LENGTH(chunk_text) as chunk_length,
    SUBSTRING(chunk_text, 1, 100) as preview
FROM DOCUMENTS
WHERE doc_name = 'Stephen Stephan Resume'
ORDER BY chunk_index;

Results: âś… PASS

  • 5 chunks with consistent total_chunks = 5
  • Section names properly stored in metadata
  • Chunk lengths: 271-2247 characters
  • All sections represented: EXPERIENCE (2x), TECHNICAL_SKILLS, PROFESSIONAL_DEVELOPMENT, EDUCATION

4. End-to-End RAG Pipeline

Objective: Validate complete pipeline from staging to query response

Test Flow:

  1. Document upload → DOCUMENT_STAGING
  2. Automated processing → Type detection → Section-based chunking
  3. Embedding generation → DOCUMENTS table with vectors
  4. Semantic query → Vector similarity search → LLM completion
  5. Query logging → QUERY_LOGS

Result: âś… PASS - Complete pipeline functional with accurate results

Test Summary

Test Case Status Notes
Section-based chunking âś… PASS Education preserved as semantic unit
Automated processing âś… PASS Type-specific routing works
Vector similarity search âś… PASS High relevance scores (0.4-0.7)
Metadata preservation âś… PASS Section names stored correctly
Query accuracy âś… PASS Complete education info retrieved
Processing status tracking âś… PASS Documents marked processed
End-to-end pipeline âś… PASS All components integrated

Setup Instructions

1. Create Database Structure

-- Create schema
CREATE SCHEMA IF NOT EXISTS RAG_SCHEMA;

-- Create tables (see schema files in /tables directory)
-- Execute in order: RAG_DOCUMENT_TYPE, DOCUMENT_STAGING, DOCUMENTS, QUERY_LOGS

-- Create initial document types
INSERT INTO RAG_SCHEMA.RAG_DOCUMENT_TYPE 
(TYPE_NAME, TYPE_DESCRIPTION, CHUNKING_FUNCTION, CHUNK_SIZE, CHUNK_OVERLAP) 
VALUES 
('General', 'All Purpose RAG', 'CHUNK_TEXT_PYTHON', 1000, 100),
('Resume', 'Complete Resume form', 'CHUNK_RESUME_SIMPLE', NULL, NULL),
('Job Description', 'Job descriptions', 'CHUNK_TEXT_PYTHON', 800, 50);

2. Create Functions

-- Create chunking functions (see /functions directory)
-- Execute: CHUNK_TEXT_PYTHON.sql, CHUNK_RESUME_SIMPLE.sql

3. Create Procedures

-- Create processing procedures (see /procedures directory)
-- Execute: PROCESS_STAGED_DOCUMENTS_TYPED.sql, ASK_QUESTION.sql, EXECUTE_RAG.sql

4. Create Views

-- Create analytics views (see /views directory)
-- Execute: DOCUMENT_STATS.sql, QUERY_ANALYTICS.sql

Usage

Upload and Process Documents

-- 1. Upload document to staging
INSERT INTO RAG_SCHEMA.DOCUMENT_STAGING (doc_name, doc_type, content, document_type_id)
VALUES (
    'John Doe Resume',
    'Resume',
    '... resume content ...',
    (SELECT typeid FROM RAG_SCHEMA.RAG_DOCUMENT_TYPE WHERE type_name = 'Resume')
);

-- 2. Process all staged documents
CALL RAG_SCHEMA.PROCESS_STAGED_DOCUMENTS_TYPED();

-- 3. Verify processing
SELECT * FROM RAG_SCHEMA.DOCUMENT_STATS WHERE doc_name = 'John Doe Resume';

Query the RAG System

-- Simple query
CALL RAG_SCHEMA.EXECUTE_RAG(
    'What is the candidate''s education background?',
    5,  -- top_k chunks to retrieve
    'llama3-8b'  -- Cortex model
);

-- Advanced query with custom parameters
CALL RAG_SCHEMA.ASK_QUESTION(
    'What programming languages does the candidate know?',
    10,
    'claude-3-sonnet'
);

Monitor System Performance

-- View document statistics
SELECT * FROM RAG_SCHEMA.DOCUMENT_STATS
ORDER BY last_updated DESC;

-- View query analytics
SELECT * FROM RAG_SCHEMA.QUERY_ANALYTICS
WHERE query_date >= CURRENT_DATE - 7
ORDER BY query_date DESC;

-- Recent queries
SELECT 
    user_query,
    retrieved_chunks,
    execution_time_ms,
    model_used,
    created_at
FROM RAG_SCHEMA.QUERY_LOGS
ORDER BY created_at DESC
LIMIT 10;

Adding New Document Types

1. Create Custom Chunking Function (Optional)

CREATE OR REPLACE FUNCTION RAG_SCHEMA.CHUNK_JOB_DESCRIPTION(content VARCHAR)
RETURNS TABLE (chunk_text VARCHAR, chunk_index NUMBER, section_name VARCHAR)
LANGUAGE SQL
AS $$
-- Custom logic for job description chunking
-- e.g., by Requirements, Responsibilities, Qualifications sections
$$;

2. Register Document Type

INSERT INTO RAG_SCHEMA.RAG_DOCUMENT_TYPE 
(TYPE_NAME, TYPE_DESCRIPTION, CHUNKING_FUNCTION, CHUNK_SIZE, CHUNK_OVERLAP) 
VALUES (
    'Job Description',
    'Job postings and descriptions',
    'CHUNK_JOB_DESCRIPTION',
    800,
    50
);

3. Update Processing Procedure

Add new ELSIF branch in PROCESS_STAGED_DOCUMENTS_TYPED:

ELSIF (type_name_var = 'Job Description') THEN
    SELECT COUNT(*) INTO total_chunks_var
    FROM TABLE(RAG_SCHEMA.CHUNK_JOB_DESCRIPTION(:content_var));
    
    INSERT INTO documents (...)
    SELECT 
        :doc_name_var,
        :doc_type_var,
        :content_var,
        chunk_text,
        chunk_index,
        :total_chunks_var,
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', chunk_text),
        OBJECT_CONSTRUCT('chunking_method', 'job_description', 'section_name', section_name, 'document_type', 'Job Description')
    FROM TABLE(RAG_SCHEMA.CHUNK_JOB_DESCRIPTION(:content_var));
    
    job_desc_count := job_desc_count + 1;
END IF;

Performance Considerations

Embedding Generation

  • Uses Snowflake Cortex EMBED_TEXT_768 (Arctic embeddings)
  • Processes during document chunking (not at query time)
  • One-time cost per chunk

Vector Similarity Search

  • VECTOR_COSINE_SIMILARITY is highly optimized in Snowflake
  • Minimum similarity threshold: 0.3 (adjustable)
  • Top-k retrieval: Default 5 chunks (adjustable)

Batch Processing

  • Processes documents sequentially via cursor
  • Each document committed independently
  • Failed documents don't block others (with error handling)

Query Performance

  • Average query time: Varies by model (typically 1-5 seconds)
  • Cached embeddings eliminate re-computation
  • LLM completion is primary latency factor

Limitations and Known Issues

Current Limitations

  1. No Multi-Document Correlation: Each document processed independently
  2. Fixed Embedding Model: Arctic-embed-m hardcoded (768 dimensions)
  3. Manual Document Upload: No automated file ingestion
  4. Limited Error Recovery: Failed chunks may need manual intervention
  5. No Incremental Updates: Document re-processing requires deletion and re-insertion

Snowflake-Specific Constraints

  1. UDTF Correlation: Cannot use table functions with correlated column references
  2. Nested Exception Handling: Limited support in SQL scripting
  3. Cursor Syntax: FOR...IN with direct field access not supported (must use variable assignment)

Future Enhancements

Planned Features

  • Job Description specialized chunking
  • Batch file upload via Snowflake stages
  • Hybrid search (keyword + semantic)
  • Multi-document synthesis queries
  • Document versioning and change tracking
  • Advanced error handling and retry logic
  • Configurable embedding models
  • Real-time processing via Snowpipe

Research Areas

  • Cross-document relationship detection
  • Automatic document type classification
  • Query result caching
  • Fine-tuned embeddings for domain-specific content

Troubleshooting

Common Issues

Issue: "Unsupported subquery type cannot be evaluated"

  • Cause: Attempting to pass column reference to table function
  • Solution: Use cursor iteration with variable binding (:variable_name)

Issue: Empty or missing chunks

  • Cause: Section detection regex not matching document format
  • Solution: Review CHUNK_RESUME_SIMPLE logic; adjust section patterns

Issue: Poor query results

  • Cause: Chunks too large/small or poor semantic boundaries
  • Solution: Adjust chunk_size/overlap or use specialized chunking function

Issue: Low similarity scores

  • Cause: Vocabulary mismatch between query and documents
  • Solution: Try query reformulation or adjust similarity threshold

Contributing

When adding new document types or chunking strategies:

  1. Create dedicated function following naming convention: CHUNK_[TYPE]
  2. Add comprehensive comments explaining chunking logic
  3. Test with sample documents before production use
  4. Update this README with new document type details
  5. Add test cases validating chunking accuracy

License

This code is provided for review and educational purposes only.

Usage Restrictions:

  • This software may NOT be used in any production environment without express written permission from the author
  • Permission must be explicitly sought and granted before any production deployment
  • Review, study, and educational use is permitted
  • Modifications and derivative works require author permission for production use

For permission requests, please contact the author.

Authors

Stephen Stephan

Built during a session focused on solving semantic chunking challenges for resume documents in Snowflake RAG systems.

Acknowledgments

  • Snowflake Cortex for embeddings and LLM completion
  • Arctic embeddings model for semantic understanding
  • Community feedback on UDTF limitations and workarounds

About

Production-ready Snowflake RAG system with type-specific chunking

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published