Snowflake RAG System with Type-Specific Document Processing

Overview

A production-ready Retrieval-Augmented Generation (RAG) system built on Snowflake that intelligently processes different document types using specialized chunking strategies. The system uses Snowflake Cortex for embeddings and LLM completion, with a focus on semantic accuracy and extensibility.

Key Features

Multi-Strategy Document Processing: Different chunking approaches for different document types
Section-Based Resume Chunking: Preserves semantic context by chunking resumes by logical sections
Vector Similarity Search: Uses Snowflake Cortex embeddings (768-dimensional Arctic embeddings)
Type-Extensible Architecture: Easy to add new document types with custom processing strategies
Automated Batch Processing: Cursor-based iteration handles multiple documents efficiently
Query Analytics: Built-in logging and performance tracking

Architecture

Database Schema

Core Tables

DOCUMENT_STAGING

Temporary storage for uploaded documents before processing
Tracks processing status and document type
Fields: staging_id, doc_name, doc_type, content, document_type_id, uploaded_at, processed

DOCUMENTS

Final storage for processed document chunks with embeddings
Fields: doc_id, doc_name, doc_type, content, chunk_text, chunk_index, total_chunks, embedding (VECTOR), metadata (VARIANT), created_at, updated_at

RAG_DOCUMENT_TYPE

Registry of document types and their processing configurations
Fields: typeid, type_name, type_description, chunking_function, chunk_size, chunk_overlap, is_active, created_at

QUERY_LOGS

Audit trail of all RAG queries with performance metrics
Fields: query_id, user_query, retrieved_chunks, response_text, execution_time_ms, model_used, created_at

Document Types

Type	Chunking Strategy	Function	Purpose
Resume	Section-based	`CHUNK_RESUME_SIMPLE`	Preserves semantic context of resume sections (Education, Experience, Skills, etc.)
General	Size-based	`CHUNK_TEXT_PYTHON`	Standard overlapping chunks for general documents
Job Description	TBD	`CHUNK_TEXT_PYTHON` (placeholder)	Ready for custom implementation

Key Components

Chunking Functions

`CHUNK_RESUME_SIMPLE(content VARCHAR)`

SQL-based UDTF that intelligently chunks resumes by detecting section headers:

Professional Summary
Technical Skills
Experience
Professional Development
Education

Returns: chunk_text, chunk_index, section_name

Key Features:

Preserves complete sections as semantic units
Filters out short/empty sections
Maintains section metadata for context

`CHUNK_TEXT_PYTHON(content VARCHAR, chunk_size INT, chunk_overlap INT)`

Python UDTF for standard text chunking:

Configurable chunk size and overlap
Attempts to break at natural boundaries (sentences, paragraphs)
Safety limits (max 1000 chunks per document)

Returns: chunk_text, chunk_index

Core Procedures

`PROCESS_STAGED_DOCUMENTS_TYPED()`

Automated batch processor for multiple document types.

Algorithm:

Query unprocessed documents with their type configuration
Use cursor iteration to process each document individually
Route to appropriate chunking function based on type_name
Generate vector embeddings using Snowflake Cortex
Store chunks with metadata
Mark document as processed

Critical Implementation Detail: Uses variable binding (:content_var) rather than column references to avoid Snowflake UDTF correlation limitations.

Returns: Processing summary with counts by document type

`ASK_QUESTION(user_question VARCHAR, top_k INT, model_name VARCHAR)`

RAG query processor.

Algorithm:

Generate query embedding using Cortex
Vector similarity search to find top-k relevant chunks (threshold: 0.3)
Aggregate chunks into context with source attribution
Submit to Cortex LLM with context-aware prompt
Return answer with source citations

Returns: answer, sources (array), query

`EXECUTE_RAG(user_question VARCHAR, top_k INT, model_name VARCHAR)`

Production wrapper around ASK_QUESTION with logging.

Features:

Performance timing
Query logging to QUERY_LOGS table
Error handling

Returns: answer, sources

Analytics Views

DOCUMENT_STATS

Summary statistics per document
Chunk counts, average chunk length
First indexed and last updated timestamps

QUERY_ANALYTICS

Daily query aggregations
Average response times
Chunks retrieved per query
Breakdown by model used

Technical Challenges Solved

Snowflake UDTF Correlation Limitation

Problem: Snowflake does not support correlated table functions. This pattern fails:

SELECT chunk_text
FROM document_staging s,
     TABLE(CHUNK_FUNCTION(s.content)) chunks  -- ❌ FAILS
WHERE s.doc_name = 'example';

Error: "Unsupported subquery type cannot be evaluated"

Solution: Use cursor-based iteration with variable binding:

FOR doc IN cursor DO
    INSERT INTO documents
    SELECT ...
    FROM TABLE(CHUNK_FUNCTION(:doc.content))  -- ✅ WORKS
END FOR;

Single Document Processing Pattern

For processing a single known document, use a scalar subquery:

INSERT INTO documents (...)
WITH chunked_data AS (
    SELECT 
        chunks.*,
        COUNT(*) OVER () as total_chunks
    FROM TABLE(CHUNK_RESUME_SIMPLE(
        (SELECT content FROM document_staging WHERE doc_name = 'Target Doc')
    )) chunks
    CROSS JOIN (SELECT doc_name, doc_type, content 
                FROM document_staging 
                WHERE doc_name = 'Target Doc') s
)
SELECT ... FROM chunked_data;

Testing

Note: The test scenarios below use the author's actual resume data. When testing with your own documents, you'll get different results based on your content, but the structure and process will be the same.

Test Scenarios Executed

1. Resume Section-Based Chunking

Objective: Verify education information is preserved as a complete semantic unit

Test Data: Resume for "Stephen Stephan" containing:

Jones College coursework (Computer Science, Business Administration)
Florida State College at Jacksonville coursework (Computer Programming, Information Systems)
Terry Parker High School diploma

Test Query: "What is Stephen's education background?"

Test Execution:

CALL RAG_SCHEMA.ASK_QUESTION('What is Stephen''s education background?', 5, 'llama3-8b');

Expected Result: Complete education history returned in context

Actual Result: ✅ PASS

Answer: According to the provided context, Stephen Stephan's education background is as follows:
* Jones College, Jacksonville, FL: Coursework in Computer Science and Business Administration
* Florida State College at Jacksonville, Jacksonville, FL: Coursework in Computer Programming and Information Systems  
* Terry Parker High School, Jacksonville, FL: High School Diploma

Sources:
- Chunk 3: similarity 0.686
- Chunk 2: similarity 0.642
- Chunk 5: similarity 0.583
- Chunk 6: similarity 0.419 (Education section)
- Chunk 4: similarity 0.395

Sources:

[
  {
    "chunk": 3,
    "document": "Stephen Stephan Resume", 
    "similarity": 0.686
  },
  {
    "chunk": 2,
    "document": "Stephen Stephan Resume",
    "similarity": 0.642
  },
  {
    "chunk": 5, 
    "document": "Stephen Stephan Resume",
    "similarity": 0.583
  },
  {
    "chunk": 6,
    "document": "Stephen Stephan Resume",
    "similarity": 0.419
  },
  {
    "chunk": 4,
    "document": "Stephen Stephan Resume",
    "similarity": 0.395
  }
]

Validation:

All three educational institutions retrieved
Complete coursework details included
High similarity scores (0.419-0.686)
Education section (chunk 6) included in top 5 results

2. Automated Multi-Document Processing

Objective: Verify automated processor handles type-specific chunking

Test Setup:

Reset document processing status:

UPDATE DOCUMENT_STAGING SET processed = FALSE WHERE doc_name = 'Stephen Stephan Resume';
DELETE FROM DOCUMENTS WHERE doc_name = 'Stephen Stephan Resume';

Execute automated processor:
```
CALL PROCESS_STAGED_DOCUMENTS_TYPED();
```

Expected Result: Resume processed with section-based chunking, marked as processed

Actual Result: ✅ PASS

Return: "Processed 1 resumes and 0 general documents"

Verification:
- 5 chunks created (indices 2-6)
- Sections: EXPERIENCE, TECHNICAL_SKILLS, PROFESSIONAL_DEVELOPMENT, EDUCATION
- Document marked processed = TRUE
- document_type_id = 2 (Resume)

3. Chunk Structure Validation

Objective: Verify chunk metadata and structure

Query:

SELECT 
    chunk_index,
    metadata:section_name::STRING as section,
    total_chunks,
    LENGTH(chunk_text) as chunk_length,
    SUBSTRING(chunk_text, 1, 100) as preview
FROM DOCUMENTS
WHERE doc_name = 'Stephen Stephan Resume'
ORDER BY chunk_index;

Results: ✅ PASS

5 chunks with consistent total_chunks = 5
Section names properly stored in metadata
Chunk lengths: 271-2247 characters
All sections represented: EXPERIENCE (2x), TECHNICAL_SKILLS, PROFESSIONAL_DEVELOPMENT, EDUCATION

4. End-to-End RAG Pipeline

Objective: Validate complete pipeline from staging to query response

Test Flow:

Document upload → DOCUMENT_STAGING
Automated processing → Type detection → Section-based chunking
Embedding generation → DOCUMENTS table with vectors
Semantic query → Vector similarity search → LLM completion
Query logging → QUERY_LOGS

Result: ✅ PASS - Complete pipeline functional with accurate results

Test Summary

Test Case	Status	Notes
Section-based chunking	✅ PASS	Education preserved as semantic unit
Automated processing	✅ PASS	Type-specific routing works
Vector similarity search	✅ PASS	High relevance scores (0.4-0.7)
Metadata preservation	✅ PASS	Section names stored correctly
Query accuracy	✅ PASS	Complete education info retrieved
Processing status tracking	✅ PASS	Documents marked processed
End-to-end pipeline	✅ PASS	All components integrated

Setup Instructions

1. Create Database Structure

-- Create schema
CREATE SCHEMA IF NOT EXISTS RAG_SCHEMA;

-- Create tables (see schema files in /tables directory)
-- Execute in order: RAG_DOCUMENT_TYPE, DOCUMENT_STAGING, DOCUMENTS, QUERY_LOGS

-- Create initial document types
INSERT INTO RAG_SCHEMA.RAG_DOCUMENT_TYPE 
(TYPE_NAME, TYPE_DESCRIPTION, CHUNKING_FUNCTION, CHUNK_SIZE, CHUNK_OVERLAP) 
VALUES 
('General', 'All Purpose RAG', 'CHUNK_TEXT_PYTHON', 1000, 100),
('Resume', 'Complete Resume form', 'CHUNK_RESUME_SIMPLE', NULL, NULL),
('Job Description', 'Job descriptions', 'CHUNK_TEXT_PYTHON', 800, 50);

2. Create Functions

-- Create chunking functions (see /functions directory)
-- Execute: CHUNK_TEXT_PYTHON.sql, CHUNK_RESUME_SIMPLE.sql

3. Create Procedures

-- Create processing procedures (see /procedures directory)
-- Execute: PROCESS_STAGED_DOCUMENTS_TYPED.sql, ASK_QUESTION.sql, EXECUTE_RAG.sql

4. Create Views

-- Create analytics views (see /views directory)
-- Execute: DOCUMENT_STATS.sql, QUERY_ANALYTICS.sql

Usage

Upload and Process Documents

-- 1. Upload document to staging
INSERT INTO RAG_SCHEMA.DOCUMENT_STAGING (doc_name, doc_type, content, document_type_id)
VALUES (
    'John Doe Resume',
    'Resume',
    '... resume content ...',
    (SELECT typeid FROM RAG_SCHEMA.RAG_DOCUMENT_TYPE WHERE type_name = 'Resume')
);

-- 2. Process all staged documents
CALL RAG_SCHEMA.PROCESS_STAGED_DOCUMENTS_TYPED();

-- 3. Verify processing
SELECT * FROM RAG_SCHEMA.DOCUMENT_STATS WHERE doc_name = 'John Doe Resume';

Query the RAG System

-- Simple query
CALL RAG_SCHEMA.EXECUTE_RAG(
    'What is the candidate''s education background?',
    5,  -- top_k chunks to retrieve
    'llama3-8b'  -- Cortex model
);

-- Advanced query with custom parameters
CALL RAG_SCHEMA.ASK_QUESTION(
    'What programming languages does the candidate know?',
    10,
    'claude-3-sonnet'
);

Monitor System Performance

-- View document statistics
SELECT * FROM RAG_SCHEMA.DOCUMENT_STATS
ORDER BY last_updated DESC;

-- View query analytics
SELECT * FROM RAG_SCHEMA.QUERY_ANALYTICS
WHERE query_date >= CURRENT_DATE - 7
ORDER BY query_date DESC;

-- Recent queries
SELECT 
    user_query,
    retrieved_chunks,
    execution_time_ms,
    model_used,
    created_at
FROM RAG_SCHEMA.QUERY_LOGS
ORDER BY created_at DESC
LIMIT 10;

Adding New Document Types

1. Create Custom Chunking Function (Optional)

CREATE OR REPLACE FUNCTION RAG_SCHEMA.CHUNK_JOB_DESCRIPTION(content VARCHAR)
RETURNS TABLE (chunk_text VARCHAR, chunk_index NUMBER, section_name VARCHAR)
LANGUAGE SQL
AS $$
-- Custom logic for job description chunking
-- e.g., by Requirements, Responsibilities, Qualifications sections
$$;

2. Register Document Type

INSERT INTO RAG_SCHEMA.RAG_DOCUMENT_TYPE 
(TYPE_NAME, TYPE_DESCRIPTION, CHUNKING_FUNCTION, CHUNK_SIZE, CHUNK_OVERLAP) 
VALUES (
    'Job Description',
    'Job postings and descriptions',
    'CHUNK_JOB_DESCRIPTION',
    800,
    50
);

3. Update Processing Procedure

Add new ELSIF branch in PROCESS_STAGED_DOCUMENTS_TYPED:

ELSIF (type_name_var = 'Job Description') THEN
    SELECT COUNT(*) INTO total_chunks_var
    FROM TABLE(RAG_SCHEMA.CHUNK_JOB_DESCRIPTION(:content_var));
    
    INSERT INTO documents (...)
    SELECT 
        :doc_name_var,
        :doc_type_var,
        :content_var,
        chunk_text,
        chunk_index,
        :total_chunks_var,
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', chunk_text),
        OBJECT_CONSTRUCT('chunking_method', 'job_description', 'section_name', section_name, 'document_type', 'Job Description')
    FROM TABLE(RAG_SCHEMA.CHUNK_JOB_DESCRIPTION(:content_var));
    
    job_desc_count := job_desc_count + 1;
END IF;

Performance Considerations

Embedding Generation

Uses Snowflake Cortex EMBED_TEXT_768 (Arctic embeddings)
Processes during document chunking (not at query time)
One-time cost per chunk

Vector Similarity Search

VECTOR_COSINE_SIMILARITY is highly optimized in Snowflake
Minimum similarity threshold: 0.3 (adjustable)
Top-k retrieval: Default 5 chunks (adjustable)

Batch Processing

Processes documents sequentially via cursor
Each document committed independently
Failed documents don't block others (with error handling)

Query Performance

Average query time: Varies by model (typically 1-5 seconds)
Cached embeddings eliminate re-computation
LLM completion is primary latency factor

Limitations and Known Issues

Current Limitations

No Multi-Document Correlation: Each document processed independently
Fixed Embedding Model: Arctic-embed-m hardcoded (768 dimensions)
Manual Document Upload: No automated file ingestion
Limited Error Recovery: Failed chunks may need manual intervention
No Incremental Updates: Document re-processing requires deletion and re-insertion

Snowflake-Specific Constraints

UDTF Correlation: Cannot use table functions with correlated column references
Nested Exception Handling: Limited support in SQL scripting
Cursor Syntax: FOR...IN with direct field access not supported (must use variable assignment)

Future Enhancements

Planned Features

Job Description specialized chunking
Batch file upload via Snowflake stages
Hybrid search (keyword + semantic)
Multi-document synthesis queries
Document versioning and change tracking
Advanced error handling and retry logic
Configurable embedding models
Real-time processing via Snowpipe

Research Areas

Cross-document relationship detection
Automatic document type classification
Query result caching
Fine-tuned embeddings for domain-specific content

Troubleshooting

Common Issues

Issue: "Unsupported subquery type cannot be evaluated"

Cause: Attempting to pass column reference to table function
Solution: Use cursor iteration with variable binding (:variable_name)

Issue: Empty or missing chunks

Cause: Section detection regex not matching document format
Solution: Review CHUNK_RESUME_SIMPLE logic; adjust section patterns

Issue: Poor query results

Cause: Chunks too large/small or poor semantic boundaries
Solution: Adjust chunk_size/overlap or use specialized chunking function

Issue: Low similarity scores

Cause: Vocabulary mismatch between query and documents
Solution: Try query reformulation or adjust similarity threshold

Contributing

When adding new document types or chunking strategies:

Create dedicated function following naming convention: CHUNK_[TYPE]
Add comprehensive comments explaining chunking logic
Test with sample documents before production use
Update this README with new document type details
Add test cases validating chunking accuracy

License

This code is provided for review and educational purposes only.

Usage Restrictions:

This software may NOT be used in any production environment without express written permission from the author
Permission must be explicitly sought and granted before any production deployment
Review, study, and educational use is permitted
Modifications and derivative works require author permission for production use

For permission requests, please contact the author.

Authors

Stephen Stephan

Built during a session focused on solving semantic chunking challenges for resume documents in Snowflake RAG systems.

Acknowledgments

Snowflake Cortex for embeddings and LLM completion
Arctic embeddings model for semantic understanding
Community feedback on UDTF limitations and workarounds

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Documentation		Documentation
functions		functions
procedures		procedures
tables		tables
views		views
readme.md		readme.md

SStephanJX/Snowflake-RAG-System

Folders and files

Latest commit

History

Repository files navigation