📚 InstructDatasetScript

Generate instruction fine-tuning datasets from various document formats using LLM APIs.

Transform your PDFs, text files, and structured data into high-quality question-answer pairs for training language models.

✨ Features

Multi-Format Input: PDF, TXT, JSON, JSONL, CSV
Multiple LLM Providers: Groq, OpenAI, DeepSeek, Ollama (local)
Smart Chunking: Intelligent text splitting with configurable overlap
Batch Processing: Process documents in configurable page batches
Rate Limit Handling: Automatic retry with exponential backoff
Dual Output: Export to JSONL and/or CSV formats
Configurable: CLI arguments or .env file configuration

🚀 Quick Start

1. Clone the Repository

git clone https://github.com/kadiryonak/InstructDatasetScript.git
cd InstructDatasetScript

2. Create Virtual Environment

python -m venv .venv

# Windows
.\.venv\Scripts\activate

# Linux/macOS
source .venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment

# Copy example config
cp .env.example .env

# Edit .env and add your API key

5. Generate Dataset

python instructDataset.py --input your_document.pdf --output dataset.jsonl

⚙️ Configuration

Environment Variables (`.env`)

# API Configuration
API_KEY=your_api_key_here
API_BASE_URL=https://api.groq.com/openai/v1
MODEL_NAME=llama3-8b-8192

# Provider: groq, openai, deepseek, ollama
PROVIDER=groq

Supported Providers

Provider	Description	API Key Required
`groq`	Groq Cloud (fast inference)	✅
`openai`	OpenAI API (GPT models)	✅
`deepseek`	DeepSeek API	✅
`ollama`	Local Ollama server	❌
`dummy`	Test mode (no API calls)	❌

📖 Usage

Basic Usage

python instructDataset.py --input document.pdf --output dataset.jsonl

Advanced Usage

python instructDataset.py \
  --input data.pdf \
  --output my_dataset.jsonl \
  --provider groq \
  --model llama3-8b-8192 \
  --chunk-size 1500 \
  --overlap 200 \
  --questions 3 \
  --lang en \
  --format both

All Arguments

python instructDataset.py --help

Argument	Short	Default	Description
`--input`	`-i`	Required	Input file path
`--output`	`-o`	`output_dataset.jsonl`	Output file path
`--provider`	`-p`	From `.env`	LLM provider
`--model`	`-m`	From `.env`	Model name
`--chunk-size`		1200	Max characters per chunk
`--overlap`		150	Overlap between chunks
`--batch-size`		5	Pages per batch (PDF)
`--questions`	`-q`	2	Questions per chunk per type
`--max-tokens`		1024	Max tokens for LLM response
`--temperature`	`-t`	0.7	LLM creativity (0-1)
`--lang`	`-l`	`tr`	Prompt language (tr/en)
`--format`	`-f`	`jsonl`	Output format (jsonl/csv/both)
`--no-metadata`		False	Exclude source metadata

📊 Hyperparameters Guide

Parameter	Description	Recommended Values
chunk-size	Maximum characters per text chunk	800-2000
overlap	Characters shared between adjacent chunks	100-300
batch-size	Number of PDF pages processed together	5-20
questions	Q&A pairs generated per chunk per question type	1-5
temperature	Higher = more creative, Lower = more focused	0.3-0.9

Recommended Configurations

Use Case	chunk-size	overlap	questions
Short documents	800	100	2
Standard	1200	150	2
Long/detailed texts	2000	300	3

📁 Supported Formats

Input Formats

Format	Description	Parser
`.pdf`	PDF documents	PyMuPDF
`.txt`	Plain text files	UTF-8
`.json`	JSON files	Recursive extraction
`.jsonl`	JSON Lines	Line-by-line
`.csv`	CSV spreadsheets	Pandas

Output Format

JSONL (one object per line):

{"question": "What is the main theme of this text?", "answer": "The text discusses...", "question_type": "summary", "_source_file": "document.pdf", "_chunk_index": 0, "_chunk_excerpt": "First 200 characters..."}

CSV columns:

question - Question
answer - Answer
question_type - Question type
_source_file - Source filename
_chunk_index - Chunk number
_chunk_excerpt - Text excerpt

📝 Example Output

Sample JSONL Entry

{
  "question": "What are the key characteristics of tragedy according to Aristotle?",
  "answer": "According to Aristotle, tragedy is an imitation of an action that is serious, complete, and of a certain magnitude. It uses language with artistic ornaments and involves incidents arousing pity and fear to accomplish catharsis.",
  "question_type": "summary",
  "_source_file": "aristoteles-poetika.pdf",
  "_chunk_index": 2,
  "_chunk_excerpt": "Tragedy is an imitation of an action that is serious..."
}

Question Types Generated

summary - Text summarization questions
factual - Information extraction questions
analysis - Deep analysis questions
definition - Concept definition questions
comparison - Comparison questions

🏗️ Project Structure

InstructDatasetScript/
├── instructDataset.py   # Main CLI script
├── config.py            # Configuration management
├── providers.py         # LLM provider abstraction
├── parsers.py           # Document parsers
├── requirements.txt     # Python dependencies
├── .env                 # Your config (private, git-ignored)
├── .env.example         # Example config for users
├── .gitignore           # Git ignore patterns
└── README.md            # This file

🔧 Extending

Adding a New Provider

Edit providers.py:

class MyProvider(LLMProvider):
    def generate(self, prompt: str, max_tokens: int = 512, **kwargs) -> str:
        # Make API call
        return response_text

# Register in PROVIDER_REGISTRY
PROVIDER_REGISTRY["myprovider"] = MyProvider

Adding a New Format

Edit parsers.py:

class MyFormatParser(DocumentParser):
    def parse_text(self) -> str:
        # Parse file and return text
        return extracted_text

# Register in PARSER_REGISTRY
PARSER_REGISTRY[".myext"] = MyFormatParser

⚠️ Rate Limiting

The script includes automatic retry with exponential backoff for rate limits:

Max retries: 8 (configurable)
Base delay: 3 seconds
Max delay: 60 seconds
Smart detection: Extracts wait time from API error messages

If you encounter persistent rate limits, consider:

Using a model with higher limits
Reducing --questions parameter
Increasing --batch-size to reduce total API calls

📄 License

MIT License

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing)
Open a Pull Request

🙏 Acknowledgments

PyMuPDF for PDF parsing
Groq for fast LLM inference
Pandas for data handling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 InstructDatasetScript

✨ Features

🚀 Quick Start

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Configure Environment

5. Generate Dataset

⚙️ Configuration

Environment Variables (`.env`)

Supported Providers

📖 Usage

Basic Usage

Advanced Usage

All Arguments

📊 Hyperparameters Guide

Recommended Configurations

📁 Supported Formats

Input Formats

Output Format

📝 Example Output

Sample JSONL Entry

Question Types Generated

🏗️ Project Structure

🔧 Extending

Adding a New Provider

Adding a New Format

⚠️ Rate Limiting

📄 License

🤝 Contributing

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
datasets		datasets
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
instructDataset.py		instructDataset.py
parsers.py		parsers.py
providers.py		providers.py
requirements.txt		requirements.txt

License

kadiryonak/InstructDatasetBuilder

Folders and files

Latest commit

History

Repository files navigation

📚 InstructDatasetScript

✨ Features

🚀 Quick Start

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Configure Environment

5. Generate Dataset

⚙️ Configuration

Environment Variables (.env)

Supported Providers

📖 Usage

Basic Usage

Advanced Usage

All Arguments

📊 Hyperparameters Guide

Recommended Configurations

📁 Supported Formats

Input Formats

Output Format

📝 Example Output

Sample JSONL Entry

Question Types Generated

🏗️ Project Structure

🔧 Extending

Adding a New Provider

Adding a New Format

⚠️ Rate Limiting

📄 License

🤝 Contributing

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Environment Variables (`.env`)

Packages