Skip to content

Smart text chunking tool for RAG systems. Splits long texts into sentence-based chunks with ~10%-15% overlap for better context retention. Runs fully in-browser with a clean UI and copyable outputs.

Notifications You must be signed in to change notification settings

alienveryilmaz/RAG-text-splitter-document-chunking-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧩 Text Chunker β€” Smart Text Chunking for RAG

Text Chunker is a lightweight, browser-based tool that splits long texts into sentence-aware chunks with configurable overlap β€” perfect for RAG pipelines, embeddings, and vector database ingestion.
Everything runs locally in your browser β€” no server, no setup, no dependencies.


✨ Features

  • βœ‚οΈ Sentence-Aware Chunking β€” Splits text intelligently by sentence boundaries for cleaner, context-aware results.
  • πŸ”— Configurable Overlap β€” Preserves context between chunks (default 12%).
  • πŸ“Š Detailed Stats β€” Displays total word count, chunk count, and average chunk size.
  • πŸ“‹ One-Click Copy β€” Instantly copy any chunk to your clipboard.
  • πŸ’» 100% Client-Side β€” Works fully offline; no backend required.
  • 🎨 Modern UI β€” Clean, responsive design built with pure HTML, CSS, and JavaScript.

🧠 How It Works

  1. Sentence Splitting
    The text is scanned for sentence-ending punctuation (., !, ?, ΰ₯€, 。) and split accordingly.

  2. Balanced Chunking
    Sentences are grouped dynamically to balance word counts across chunks without exceeding the maximum limit.

  3. Context Overlap
    Each chunk (except the first) includes a small portion (~12%) of the previous chunk’s tail sentences to maintain semantic continuity β€” ideal for RAG, LLMs, or embedding generation.


βš™οΈ Configuration

All settings can be adjusted from the app interface (βš™οΈ Configuration Settings) or directly via code.

Setting Default Description
Max Chunk Size (words) 400 Maximum word count per chunk.
Overlap Percentage 12 Percentage of previous chunk words to overlap.
Overlap Flexibility 1.5 Allows up to 1.5Γ— overlap range to include full sentences.

πŸš€ Quick Start

  1. Clone this repository:
    git clone https://github.com/<your-username>/<your-repo-name>.git
    cd <your-repo-name>
    
  2. Open the file textChunkerRagTool.html in your browser.
  3. Paste your text, adjust settings if needed, and click β€œChunk Text”.
  4. Copy chunks easily using the πŸ“‹ Copy buttons.

πŸ“Š Interface Overview

  • πŸ“₯ Input Section β€” Paste or write the text you want to chunk.
  • πŸ“ˆ Stats Panel β€” Displays total word count, chunk count, and average size.
  • πŸ“€ Chunked Output β€” Lists each chunk with overlap information and a copy button.
  • βš™οΈ Settings Panel β€” Configure chunk size and overlap interactively.

πŸ”’ Privacy

  • All processing occurs entirely in your browser.
  • No data is sent to external servers β€” safe for confidential or private text.

🧩 Tech Stack

  • HTML5
  • CSS3
  • Vanilla JavaScript No external dependencies or frameworks required. Works on all modern browsers.

πŸ—ΊοΈ Roadmap Ideas

  • 🧾 Export chunks as JSON / TXT
  • 🧠 Add token-based chunking (e.g., using tiktoken)
  • 🌍 Multilingual sentence detection
  • πŸ“‚ Drag-and-drop file input (PDF/DOCX via client-side parsing)
  • πŸ” Semantic similarity visualization between chunks

🀝 Contributing

  • Pull requests and feature ideas are welcome!
  • Please keep the project lightweight and dependency-free.
  • If you submit UI changes, include a short before/after example or screenshot.

πŸ“„ License

  • This project is licensed under the MIT License.
  • You are free to use, modify, and distribute it for both personal and commercial purposes.

πŸ‘¨β€πŸ’» Author

  • Developed by Ali Enver YΔ±lmaz(me)
  • A simple yet powerful open-source tool for developers working with RAG, LLMs, and NLP pipelines who need fast and reliable text segmentation.

About

Smart text chunking tool for RAG systems. Splits long texts into sentence-based chunks with ~10%-15% overlap for better context retention. Runs fully in-browser with a clean UI and copyable outputs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages