This repository accompanies the course Preprocessing Unstructured Data for LLM Applications, created by DeepLearning.AI in collaboration with Unstructured.
The course teaches how to transform messy, real-world documents into structured, high-quality datasets that can be effectively used for LLM-powered applications such as Retrieval-Augmented Generation (RAG), chatbots, and knowledge assistants.
Large Language Models (LLMs) are powerful but rely heavily on high-quality input data. Real-world data โ PDFs, images, HTML pages, Word docs โ often comes in messy, unstructured formats. Key concepts:
-
Unstructured Data: Free-form data (text, images, tables, scanned docs) that lacks a consistent schema.
-
Why Preprocessing Matters:
- Removes noise and inconsistencies.
- Ensures inputs are machine-readable.
- Preserves context and meaning for better model performance.
-
LLM Applications That Rely on Clean Data:
- Question-answering bots.
- Document summarizers.
- Knowledge retrieval systems.
- Enterprise RAG pipelines.
In essence, preprocessing bridges the gap between raw, messy inputs and LLM-ready structured data.
Unstructured text often has inconsistencies โ multiple encodings, extra whitespace, broken Unicode characters, or mismatched formats. Normalization ensures consistency across documents before further processing.
Key techniques:
- Encoding Fixes: Converting everything to UTF-8 for uniformity.
- Cleaning Noise: Removing escape sequences, non-printable characters, or OCR errors.
- Standardizing Case & Whitespace: Useful for uniform tokenization.
- Removing Boilerplate: Headers, footers, and irrelevant text (common in PDFs/HTML).
Outcome: A normalized, uniform text representation that makes downstream processing (like chunking or metadata tagging) reliable.
LLMs perform poorly if fed entire large documents. Instead, we split documents into semantically meaningful chunks and attach metadata for context.
-
Metadata Extraction: Captures document attributes.
- Examples: title, author, date, source URL, section headers, file type.
- Benefits: Helps filtering and ranking results during retrieval.
-
Chunking: Dividing text into smaller, meaningful units.
- Naive chunking: Fixed-length windows (e.g., 500 tokens).
- Semantic chunking: Splits aligned with natural boundaries (paragraphs, sections).
- Hybrid approaches: Combine token-based and semantic splitting.
-
Why Chunking Matters:
- Prevents LLM context overflow.
- Improves retrieval granularity.
- Enhances accuracy of answers in RAG systems.
Real-world documents are often PDFs (text-based or scanned) and images (screenshots, scanned contracts, receipts). Extracting information from them requires specialized tools.
-
Text-based PDFs: Direct parsing with libraries (e.g.,
pdfminer,PyPDF2,unstructured). -
Scanned PDFs & Images: Require OCR (Optical Character Recognition) with tools like
TesseractorAzure OCR. -
Challenges:
- Misaligned text flow.
- Columns/tables embedded as images.
- Poor-quality scans.
-
Best Practices:
- Detect document type before processing.
- Use hybrid extraction (text + OCR).
- Validate extracted content with metadata checks.
Tables often contain critical structured information (financial statements, experiment results, product catalogs). But theyโre notoriously hard to parse correctly.
-
Challenges in Table Extraction:
- Inconsistent layouts.
- Nested headers.
- Scanned vs. digital tables.
-
Extraction Techniques:
- Rule-based parsing with
pandas.read_htmlorcamelot(for PDFs). - OCR + table detection models (for images/scans).
- Using
unstructuredto preserve table boundaries.
- Rule-based parsing with
-
Preserving Semantics: Store tables as structured JSON or DataFrames rather than plain text, so LLMs can understand relationships between rows/columns.
The final project brings everything together: turning unstructured documents into a Retrieval-Augmented Generation (RAG) chatbot.
Pipeline overview:
-
Ingestion & Preprocessing
- Normalize, clean, and extract text + metadata.
- Split into chunks.
- Handle PDFs, images, and tables as needed.
-
Embedding & Indexing
- Convert chunks into embeddings using models like OpenAIโs
text-embedding-ada-002or local alternatives. - Store embeddings in a vector database (e.g., Pinecone, Weaviate, FAISS).
- Convert chunks into embeddings using models like OpenAIโs
-
Retrieval
- Query user input against the vector store.
- Retrieve the most relevant chunks.
-
LLM Response Generation
- Feed retrieved context + user query into an LLM.
- Generate a context-aware response.
By the end, learners can build a production-ready RAG system that makes unstructured documents queryable through natural language.
This course was made possible through the collaboration between:
- DeepLearning.AI โ for designing and delivering high-quality AI education.
- Unstructured โ for providing cutting-edge open-source tools to process and structure unstructured data.
- The broader open-source community, whose libraries and frameworks form the foundation for practical preprocessing pipelines.
Special thanks to the instructors, engineers, and contributors who shaped this learning experience and made complex data preprocessing concepts accessible to learners worldwide.