A production-ready Retrieval-Augmented Generation (RAG) system built on Snowflake that intelligently processes different document types using specialized chunking strategies. The system uses Snowflake Cortex for embeddings and LLM completion, with a focus on semantic accuracy and extensibility.
- Multi-Strategy Document Processing: Different chunking approaches for different document types
- Section-Based Resume Chunking: Preserves semantic context by chunking resumes by logical sections
- Vector Similarity Search: Uses Snowflake Cortex embeddings (768-dimensional Arctic embeddings)
- Type-Extensible Architecture: Easy to add new document types with custom processing strategies
- Automated Batch Processing: Cursor-based iteration handles multiple documents efficiently
- Query Analytics: Built-in logging and performance tracking
DOCUMENT_STAGING
- Temporary storage for uploaded documents before processing
- Tracks processing status and document type
- Fields:
staging_id,doc_name,doc_type,content,document_type_id,uploaded_at,processed
DOCUMENTS
- Final storage for processed document chunks with embeddings
- Fields:
doc_id,doc_name,doc_type,content,chunk_text,chunk_index,total_chunks,embedding(VECTOR),metadata(VARIANT),created_at,updated_at
RAG_DOCUMENT_TYPE
- Registry of document types and their processing configurations
- Fields:
typeid,type_name,type_description,chunking_function,chunk_size,chunk_overlap,is_active,created_at
QUERY_LOGS
- Audit trail of all RAG queries with performance metrics
- Fields:
query_id,user_query,retrieved_chunks,response_text,execution_time_ms,model_used,created_at
| Type | Chunking Strategy | Function | Purpose |
|---|---|---|---|
| Resume | Section-based | CHUNK_RESUME_SIMPLE |
Preserves semantic context of resume sections (Education, Experience, Skills, etc.) |
| General | Size-based | CHUNK_TEXT_PYTHON |
Standard overlapping chunks for general documents |
| Job Description | TBD | CHUNK_TEXT_PYTHON (placeholder) |
Ready for custom implementation |
SQL-based UDTF that intelligently chunks resumes by detecting section headers:
- Professional Summary
- Technical Skills
- Experience
- Professional Development
- Education
Returns: chunk_text, chunk_index, section_name
Key Features:
- Preserves complete sections as semantic units
- Filters out short/empty sections
- Maintains section metadata for context
Python UDTF for standard text chunking:
- Configurable chunk size and overlap
- Attempts to break at natural boundaries (sentences, paragraphs)
- Safety limits (max 1000 chunks per document)
Returns: chunk_text, chunk_index
Automated batch processor for multiple document types.
Algorithm:
- Query unprocessed documents with their type configuration
- Use cursor iteration to process each document individually
- Route to appropriate chunking function based on
type_name - Generate vector embeddings using Snowflake Cortex
- Store chunks with metadata
- Mark document as processed
Critical Implementation Detail: Uses variable binding (:content_var) rather than column references to avoid Snowflake UDTF correlation limitations.
Returns: Processing summary with counts by document type
RAG query processor.
Algorithm:
- Generate query embedding using Cortex
- Vector similarity search to find top-k relevant chunks (threshold: 0.3)
- Aggregate chunks into context with source attribution
- Submit to Cortex LLM with context-aware prompt
- Return answer with source citations
Returns: answer, sources (array), query
Production wrapper around ASK_QUESTION with logging.
Features:
- Performance timing
- Query logging to
QUERY_LOGStable - Error handling
Returns: answer, sources
DOCUMENT_STATS
- Summary statistics per document
- Chunk counts, average chunk length
- First indexed and last updated timestamps
QUERY_ANALYTICS
- Daily query aggregations
- Average response times
- Chunks retrieved per query
- Breakdown by model used
Problem: Snowflake does not support correlated table functions. This pattern fails:
SELECT chunk_text
FROM document_staging s,
TABLE(CHUNK_FUNCTION(s.content)) chunks -- ❌ FAILS
WHERE s.doc_name = 'example';Error: "Unsupported subquery type cannot be evaluated"
Solution: Use cursor-based iteration with variable binding:
FOR doc IN cursor DO
INSERT INTO documents
SELECT ...
FROM TABLE(CHUNK_FUNCTION(:doc.content)) -- âś… WORKS
END FOR;For processing a single known document, use a scalar subquery:
INSERT INTO documents (...)
WITH chunked_data AS (
SELECT
chunks.*,
COUNT(*) OVER () as total_chunks
FROM TABLE(CHUNK_RESUME_SIMPLE(
(SELECT content FROM document_staging WHERE doc_name = 'Target Doc')
)) chunks
CROSS JOIN (SELECT doc_name, doc_type, content
FROM document_staging
WHERE doc_name = 'Target Doc') s
)
SELECT ... FROM chunked_data;Note: The test scenarios below use the author's actual resume data. When testing with your own documents, you'll get different results based on your content, but the structure and process will be the same.
Objective: Verify education information is preserved as a complete semantic unit
Test Data: Resume for "Stephen Stephan" containing:
- Jones College coursework (Computer Science, Business Administration)
- Florida State College at Jacksonville coursework (Computer Programming, Information Systems)
- Terry Parker High School diploma
Test Query: "What is Stephen's education background?"
Test Execution:
CALL RAG_SCHEMA.ASK_QUESTION('What is Stephen''s education background?', 5, 'llama3-8b');Expected Result: Complete education history returned in context
Actual Result: âś… PASS
Answer: According to the provided context, Stephen Stephan's education background is as follows:
* Jones College, Jacksonville, FL: Coursework in Computer Science and Business Administration
* Florida State College at Jacksonville, Jacksonville, FL: Coursework in Computer Programming and Information Systems
* Terry Parker High School, Jacksonville, FL: High School Diploma
Sources:
- Chunk 3: similarity 0.686
- Chunk 2: similarity 0.642
- Chunk 5: similarity 0.583
- Chunk 6: similarity 0.419 (Education section)
- Chunk 4: similarity 0.395
Sources:
[
{
"chunk": 3,
"document": "Stephen Stephan Resume",
"similarity": 0.686
},
{
"chunk": 2,
"document": "Stephen Stephan Resume",
"similarity": 0.642
},
{
"chunk": 5,
"document": "Stephen Stephan Resume",
"similarity": 0.583
},
{
"chunk": 6,
"document": "Stephen Stephan Resume",
"similarity": 0.419
},
{
"chunk": 4,
"document": "Stephen Stephan Resume",
"similarity": 0.395
}
]Validation:
- All three educational institutions retrieved
- Complete coursework details included
- High similarity scores (0.419-0.686)
- Education section (chunk 6) included in top 5 results
Objective: Verify automated processor handles type-specific chunking
Test Setup:
-
Reset document processing status:
UPDATE DOCUMENT_STAGING SET processed = FALSE WHERE doc_name = 'Stephen Stephan Resume'; DELETE FROM DOCUMENTS WHERE doc_name = 'Stephen Stephan Resume';
-
Execute automated processor:
CALL PROCESS_STAGED_DOCUMENTS_TYPED();
Expected Result: Resume processed with section-based chunking, marked as processed
Actual Result: âś… PASS
Return: "Processed 1 resumes and 0 general documents"
Verification:
- 5 chunks created (indices 2-6)
- Sections: EXPERIENCE, TECHNICAL_SKILLS, PROFESSIONAL_DEVELOPMENT, EDUCATION
- Document marked processed = TRUE
- document_type_id = 2 (Resume)
Objective: Verify chunk metadata and structure
Query:
SELECT
chunk_index,
metadata:section_name::STRING as section,
total_chunks,
LENGTH(chunk_text) as chunk_length,
SUBSTRING(chunk_text, 1, 100) as preview
FROM DOCUMENTS
WHERE doc_name = 'Stephen Stephan Resume'
ORDER BY chunk_index;Results: âś… PASS
- 5 chunks with consistent
total_chunks = 5 - Section names properly stored in metadata
- Chunk lengths: 271-2247 characters
- All sections represented: EXPERIENCE (2x), TECHNICAL_SKILLS, PROFESSIONAL_DEVELOPMENT, EDUCATION
Objective: Validate complete pipeline from staging to query response
Test Flow:
- Document upload →
DOCUMENT_STAGING - Automated processing → Type detection → Section-based chunking
- Embedding generation →
DOCUMENTStable with vectors - Semantic query → Vector similarity search → LLM completion
- Query logging →
QUERY_LOGS
Result: âś… PASS - Complete pipeline functional with accurate results
| Test Case | Status | Notes |
|---|---|---|
| Section-based chunking | âś… PASS | Education preserved as semantic unit |
| Automated processing | âś… PASS | Type-specific routing works |
| Vector similarity search | âś… PASS | High relevance scores (0.4-0.7) |
| Metadata preservation | âś… PASS | Section names stored correctly |
| Query accuracy | âś… PASS | Complete education info retrieved |
| Processing status tracking | âś… PASS | Documents marked processed |
| End-to-end pipeline | âś… PASS | All components integrated |
-- Create schema
CREATE SCHEMA IF NOT EXISTS RAG_SCHEMA;
-- Create tables (see schema files in /tables directory)
-- Execute in order: RAG_DOCUMENT_TYPE, DOCUMENT_STAGING, DOCUMENTS, QUERY_LOGS
-- Create initial document types
INSERT INTO RAG_SCHEMA.RAG_DOCUMENT_TYPE
(TYPE_NAME, TYPE_DESCRIPTION, CHUNKING_FUNCTION, CHUNK_SIZE, CHUNK_OVERLAP)
VALUES
('General', 'All Purpose RAG', 'CHUNK_TEXT_PYTHON', 1000, 100),
('Resume', 'Complete Resume form', 'CHUNK_RESUME_SIMPLE', NULL, NULL),
('Job Description', 'Job descriptions', 'CHUNK_TEXT_PYTHON', 800, 50);-- Create chunking functions (see /functions directory)
-- Execute: CHUNK_TEXT_PYTHON.sql, CHUNK_RESUME_SIMPLE.sql-- Create processing procedures (see /procedures directory)
-- Execute: PROCESS_STAGED_DOCUMENTS_TYPED.sql, ASK_QUESTION.sql, EXECUTE_RAG.sql-- Create analytics views (see /views directory)
-- Execute: DOCUMENT_STATS.sql, QUERY_ANALYTICS.sql-- 1. Upload document to staging
INSERT INTO RAG_SCHEMA.DOCUMENT_STAGING (doc_name, doc_type, content, document_type_id)
VALUES (
'John Doe Resume',
'Resume',
'... resume content ...',
(SELECT typeid FROM RAG_SCHEMA.RAG_DOCUMENT_TYPE WHERE type_name = 'Resume')
);
-- 2. Process all staged documents
CALL RAG_SCHEMA.PROCESS_STAGED_DOCUMENTS_TYPED();
-- 3. Verify processing
SELECT * FROM RAG_SCHEMA.DOCUMENT_STATS WHERE doc_name = 'John Doe Resume';-- Simple query
CALL RAG_SCHEMA.EXECUTE_RAG(
'What is the candidate''s education background?',
5, -- top_k chunks to retrieve
'llama3-8b' -- Cortex model
);
-- Advanced query with custom parameters
CALL RAG_SCHEMA.ASK_QUESTION(
'What programming languages does the candidate know?',
10,
'claude-3-sonnet'
);-- View document statistics
SELECT * FROM RAG_SCHEMA.DOCUMENT_STATS
ORDER BY last_updated DESC;
-- View query analytics
SELECT * FROM RAG_SCHEMA.QUERY_ANALYTICS
WHERE query_date >= CURRENT_DATE - 7
ORDER BY query_date DESC;
-- Recent queries
SELECT
user_query,
retrieved_chunks,
execution_time_ms,
model_used,
created_at
FROM RAG_SCHEMA.QUERY_LOGS
ORDER BY created_at DESC
LIMIT 10;CREATE OR REPLACE FUNCTION RAG_SCHEMA.CHUNK_JOB_DESCRIPTION(content VARCHAR)
RETURNS TABLE (chunk_text VARCHAR, chunk_index NUMBER, section_name VARCHAR)
LANGUAGE SQL
AS $$
-- Custom logic for job description chunking
-- e.g., by Requirements, Responsibilities, Qualifications sections
$$;INSERT INTO RAG_SCHEMA.RAG_DOCUMENT_TYPE
(TYPE_NAME, TYPE_DESCRIPTION, CHUNKING_FUNCTION, CHUNK_SIZE, CHUNK_OVERLAP)
VALUES (
'Job Description',
'Job postings and descriptions',
'CHUNK_JOB_DESCRIPTION',
800,
50
);Add new ELSIF branch in PROCESS_STAGED_DOCUMENTS_TYPED:
ELSIF (type_name_var = 'Job Description') THEN
SELECT COUNT(*) INTO total_chunks_var
FROM TABLE(RAG_SCHEMA.CHUNK_JOB_DESCRIPTION(:content_var));
INSERT INTO documents (...)
SELECT
:doc_name_var,
:doc_type_var,
:content_var,
chunk_text,
chunk_index,
:total_chunks_var,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', chunk_text),
OBJECT_CONSTRUCT('chunking_method', 'job_description', 'section_name', section_name, 'document_type', 'Job Description')
FROM TABLE(RAG_SCHEMA.CHUNK_JOB_DESCRIPTION(:content_var));
job_desc_count := job_desc_count + 1;
END IF;- Uses Snowflake Cortex
EMBED_TEXT_768(Arctic embeddings) - Processes during document chunking (not at query time)
- One-time cost per chunk
VECTOR_COSINE_SIMILARITYis highly optimized in Snowflake- Minimum similarity threshold: 0.3 (adjustable)
- Top-k retrieval: Default 5 chunks (adjustable)
- Processes documents sequentially via cursor
- Each document committed independently
- Failed documents don't block others (with error handling)
- Average query time: Varies by model (typically 1-5 seconds)
- Cached embeddings eliminate re-computation
- LLM completion is primary latency factor
- No Multi-Document Correlation: Each document processed independently
- Fixed Embedding Model: Arctic-embed-m hardcoded (768 dimensions)
- Manual Document Upload: No automated file ingestion
- Limited Error Recovery: Failed chunks may need manual intervention
- No Incremental Updates: Document re-processing requires deletion and re-insertion
- UDTF Correlation: Cannot use table functions with correlated column references
- Nested Exception Handling: Limited support in SQL scripting
- Cursor Syntax:
FOR...INwith direct field access not supported (must use variable assignment)
- Job Description specialized chunking
- Batch file upload via Snowflake stages
- Hybrid search (keyword + semantic)
- Multi-document synthesis queries
- Document versioning and change tracking
- Advanced error handling and retry logic
- Configurable embedding models
- Real-time processing via Snowpipe
- Cross-document relationship detection
- Automatic document type classification
- Query result caching
- Fine-tuned embeddings for domain-specific content
Issue: "Unsupported subquery type cannot be evaluated"
- Cause: Attempting to pass column reference to table function
- Solution: Use cursor iteration with variable binding (
:variable_name)
Issue: Empty or missing chunks
- Cause: Section detection regex not matching document format
- Solution: Review
CHUNK_RESUME_SIMPLElogic; adjust section patterns
Issue: Poor query results
- Cause: Chunks too large/small or poor semantic boundaries
- Solution: Adjust chunk_size/overlap or use specialized chunking function
Issue: Low similarity scores
- Cause: Vocabulary mismatch between query and documents
- Solution: Try query reformulation or adjust similarity threshold
When adding new document types or chunking strategies:
- Create dedicated function following naming convention:
CHUNK_[TYPE] - Add comprehensive comments explaining chunking logic
- Test with sample documents before production use
- Update this README with new document type details
- Add test cases validating chunking accuracy
This code is provided for review and educational purposes only.
Usage Restrictions:
- This software may NOT be used in any production environment without express written permission from the author
- Permission must be explicitly sought and granted before any production deployment
- Review, study, and educational use is permitted
- Modifications and derivative works require author permission for production use
For permission requests, please contact the author.
Stephen Stephan
Built during a session focused on solving semantic chunking challenges for resume documents in Snowflake RAG systems.
- Snowflake Cortex for embeddings and LLM completion
- Arctic embeddings model for semantic understanding
- Community feedback on UDTF limitations and workarounds