Skip to content

kadiryonak/InstructDatasetBuilder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“š InstructDatasetScript

Generate instruction fine-tuning datasets from various document formats using LLM APIs.

Transform your PDFs, text files, and structured data into high-quality question-answer pairs for training language models.


✨ Features

  • Multi-Format Input: PDF, TXT, JSON, JSONL, CSV
  • Multiple LLM Providers: Groq, OpenAI, DeepSeek, Ollama (local)
  • Smart Chunking: Intelligent text splitting with configurable overlap
  • Batch Processing: Process documents in configurable page batches
  • Rate Limit Handling: Automatic retry with exponential backoff
  • Dual Output: Export to JSONL and/or CSV formats
  • Configurable: CLI arguments or .env file configuration

πŸš€ Quick Start

1. Clone the Repository

git clone https://github.com/kadiryonak/InstructDatasetScript.git
cd InstructDatasetScript

2. Create Virtual Environment

python -m venv .venv

# Windows
.\.venv\Scripts\activate

# Linux/macOS
source .venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment

# Copy example config
cp .env.example .env

# Edit .env and add your API key

5. Generate Dataset

python instructDataset.py --input your_document.pdf --output dataset.jsonl

βš™οΈ Configuration

Environment Variables (.env)

# API Configuration
API_KEY=your_api_key_here
API_BASE_URL=https://api.groq.com/openai/v1
MODEL_NAME=llama3-8b-8192

# Provider: groq, openai, deepseek, ollama
PROVIDER=groq

Supported Providers

Provider Description API Key Required
groq Groq Cloud (fast inference) βœ…
openai OpenAI API (GPT models) βœ…
deepseek DeepSeek API βœ…
ollama Local Ollama server ❌
dummy Test mode (no API calls) ❌

πŸ“– Usage

Basic Usage

python instructDataset.py --input document.pdf --output dataset.jsonl

Advanced Usage

python instructDataset.py \
  --input data.pdf \
  --output my_dataset.jsonl \
  --provider groq \
  --model llama3-8b-8192 \
  --chunk-size 1500 \
  --overlap 200 \
  --questions 3 \
  --lang en \
  --format both

All Arguments

python instructDataset.py --help
Argument Short Default Description
--input -i Required Input file path
--output -o output_dataset.jsonl Output file path
--provider -p From .env LLM provider
--model -m From .env Model name
--chunk-size 1200 Max characters per chunk
--overlap 150 Overlap between chunks
--batch-size 5 Pages per batch (PDF)
--questions -q 2 Questions per chunk per type
--max-tokens 1024 Max tokens for LLM response
--temperature -t 0.7 LLM creativity (0-1)
--lang -l tr Prompt language (tr/en)
--format -f jsonl Output format (jsonl/csv/both)
--no-metadata False Exclude source metadata

πŸ“Š Hyperparameters Guide

Parameter Description Recommended Values
chunk-size Maximum characters per text chunk 800-2000
overlap Characters shared between adjacent chunks 100-300
batch-size Number of PDF pages processed together 5-20
questions Q&A pairs generated per chunk per question type 1-5
temperature Higher = more creative, Lower = more focused 0.3-0.9

Recommended Configurations

Use Case chunk-size overlap questions
Short documents 800 100 2
Standard 1200 150 2
Long/detailed texts 2000 300 3

πŸ“ Supported Formats

Input Formats

Format Description Parser
.pdf PDF documents PyMuPDF
.txt Plain text files UTF-8
.json JSON files Recursive extraction
.jsonl JSON Lines Line-by-line
.csv CSV spreadsheets Pandas

Output Format

JSONL (one object per line):

{"question": "What is the main theme of this text?", "answer": "The text discusses...", "question_type": "summary", "_source_file": "document.pdf", "_chunk_index": 0, "_chunk_excerpt": "First 200 characters..."}

CSV columns:

  • question - Question
  • answer - Answer
  • question_type - Question type
  • _source_file - Source filename
  • _chunk_index - Chunk number
  • _chunk_excerpt - Text excerpt

πŸ“ Example Output

Sample JSONL Entry

{
  "question": "What are the key characteristics of tragedy according to Aristotle?",
  "answer": "According to Aristotle, tragedy is an imitation of an action that is serious, complete, and of a certain magnitude. It uses language with artistic ornaments and involves incidents arousing pity and fear to accomplish catharsis.",
  "question_type": "summary",
  "_source_file": "aristoteles-poetika.pdf",
  "_chunk_index": 2,
  "_chunk_excerpt": "Tragedy is an imitation of an action that is serious..."
}

Question Types Generated

  • summary - Text summarization questions
  • factual - Information extraction questions
  • analysis - Deep analysis questions
  • definition - Concept definition questions
  • comparison - Comparison questions

πŸ—οΈ Project Structure

InstructDatasetScript/
β”œβ”€β”€ instructDataset.py   # Main CLI script
β”œβ”€β”€ config.py            # Configuration management
β”œβ”€β”€ providers.py         # LLM provider abstraction
β”œβ”€β”€ parsers.py           # Document parsers
β”œβ”€β”€ requirements.txt     # Python dependencies
β”œβ”€β”€ .env                 # Your config (private, git-ignored)
β”œβ”€β”€ .env.example         # Example config for users
β”œβ”€β”€ .gitignore           # Git ignore patterns
└── README.md            # This file

πŸ”§ Extending

Adding a New Provider

Edit providers.py:

class MyProvider(LLMProvider):
    def generate(self, prompt: str, max_tokens: int = 512, **kwargs) -> str:
        # Make API call
        return response_text

# Register in PROVIDER_REGISTRY
PROVIDER_REGISTRY["myprovider"] = MyProvider

Adding a New Format

Edit parsers.py:

class MyFormatParser(DocumentParser):
    def parse_text(self) -> str:
        # Parse file and return text
        return extracted_text

# Register in PARSER_REGISTRY
PARSER_REGISTRY[".myext"] = MyFormatParser

⚠️ Rate Limiting

The script includes automatic retry with exponential backoff for rate limits:

  • Max retries: 8 (configurable)
  • Base delay: 3 seconds
  • Max delay: 60 seconds
  • Smart detection: Extracts wait time from API error messages

If you encounter persistent rate limits, consider:

  1. Using a model with higher limits
  2. Reducing --questions parameter
  3. Increasing --batch-size to reduce total API calls

πŸ“„ License

MIT License


🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing)
  5. Open a Pull Request

πŸ™ Acknowledgments

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages