Generate instruction fine-tuning datasets from various document formats using LLM APIs.
Transform your PDFs, text files, and structured data into high-quality question-answer pairs for training language models.
- Multi-Format Input: PDF, TXT, JSON, JSONL, CSV
- Multiple LLM Providers: Groq, OpenAI, DeepSeek, Ollama (local)
- Smart Chunking: Intelligent text splitting with configurable overlap
- Batch Processing: Process documents in configurable page batches
- Rate Limit Handling: Automatic retry with exponential backoff
- Dual Output: Export to JSONL and/or CSV formats
- Configurable: CLI arguments or
.envfile configuration
git clone https://github.com/kadiryonak/InstructDatasetScript.git
cd InstructDatasetScriptpython -m venv .venv
# Windows
.\.venv\Scripts\activate
# Linux/macOS
source .venv/bin/activatepip install -r requirements.txt# Copy example config
cp .env.example .env
# Edit .env and add your API keypython instructDataset.py --input your_document.pdf --output dataset.jsonl# API Configuration
API_KEY=your_api_key_here
API_BASE_URL=https://api.groq.com/openai/v1
MODEL_NAME=llama3-8b-8192
# Provider: groq, openai, deepseek, ollama
PROVIDER=groq| Provider | Description | API Key Required |
|---|---|---|
groq |
Groq Cloud (fast inference) | β |
openai |
OpenAI API (GPT models) | β |
deepseek |
DeepSeek API | β |
ollama |
Local Ollama server | β |
dummy |
Test mode (no API calls) | β |
python instructDataset.py --input document.pdf --output dataset.jsonlpython instructDataset.py \
--input data.pdf \
--output my_dataset.jsonl \
--provider groq \
--model llama3-8b-8192 \
--chunk-size 1500 \
--overlap 200 \
--questions 3 \
--lang en \
--format bothpython instructDataset.py --help
| Argument | Short | Default | Description |
|---|---|---|---|
--input |
-i |
Required | Input file path |
--output |
-o |
output_dataset.jsonl |
Output file path |
--provider |
-p |
From .env |
LLM provider |
--model |
-m |
From .env |
Model name |
--chunk-size |
1200 | Max characters per chunk | |
--overlap |
150 | Overlap between chunks | |
--batch-size |
5 | Pages per batch (PDF) | |
--questions |
-q |
2 | Questions per chunk per type |
--max-tokens |
1024 | Max tokens for LLM response | |
--temperature |
-t |
0.7 | LLM creativity (0-1) |
--lang |
-l |
tr |
Prompt language (tr/en) |
--format |
-f |
jsonl |
Output format (jsonl/csv/both) |
--no-metadata |
False | Exclude source metadata |
| Parameter | Description | Recommended Values |
|---|---|---|
| chunk-size | Maximum characters per text chunk | 800-2000 |
| overlap | Characters shared between adjacent chunks | 100-300 |
| batch-size | Number of PDF pages processed together | 5-20 |
| questions | Q&A pairs generated per chunk per question type | 1-5 |
| temperature | Higher = more creative, Lower = more focused | 0.3-0.9 |
| Use Case | chunk-size | overlap | questions |
|---|---|---|---|
| Short documents | 800 | 100 | 2 |
| Standard | 1200 | 150 | 2 |
| Long/detailed texts | 2000 | 300 | 3 |
| Format | Description | Parser |
|---|---|---|
.pdf |
PDF documents | PyMuPDF |
.txt |
Plain text files | UTF-8 |
.json |
JSON files | Recursive extraction |
.jsonl |
JSON Lines | Line-by-line |
.csv |
CSV spreadsheets | Pandas |
JSONL (one object per line):
{"question": "What is the main theme of this text?", "answer": "The text discusses...", "question_type": "summary", "_source_file": "document.pdf", "_chunk_index": 0, "_chunk_excerpt": "First 200 characters..."}CSV columns:
question- Questionanswer- Answerquestion_type- Question type_source_file- Source filename_chunk_index- Chunk number_chunk_excerpt- Text excerpt
{
"question": "What are the key characteristics of tragedy according to Aristotle?",
"answer": "According to Aristotle, tragedy is an imitation of an action that is serious, complete, and of a certain magnitude. It uses language with artistic ornaments and involves incidents arousing pity and fear to accomplish catharsis.",
"question_type": "summary",
"_source_file": "aristoteles-poetika.pdf",
"_chunk_index": 2,
"_chunk_excerpt": "Tragedy is an imitation of an action that is serious..."
}- summary - Text summarization questions
- factual - Information extraction questions
- analysis - Deep analysis questions
- definition - Concept definition questions
- comparison - Comparison questions
InstructDatasetScript/
βββ instructDataset.py # Main CLI script
βββ config.py # Configuration management
βββ providers.py # LLM provider abstraction
βββ parsers.py # Document parsers
βββ requirements.txt # Python dependencies
βββ .env # Your config (private, git-ignored)
βββ .env.example # Example config for users
βββ .gitignore # Git ignore patterns
βββ README.md # This file
Edit providers.py:
class MyProvider(LLMProvider):
def generate(self, prompt: str, max_tokens: int = 512, **kwargs) -> str:
# Make API call
return response_text
# Register in PROVIDER_REGISTRY
PROVIDER_REGISTRY["myprovider"] = MyProviderEdit parsers.py:
class MyFormatParser(DocumentParser):
def parse_text(self) -> str:
# Parse file and return text
return extracted_text
# Register in PARSER_REGISTRY
PARSER_REGISTRY[".myext"] = MyFormatParserThe script includes automatic retry with exponential backoff for rate limits:
- Max retries: 8 (configurable)
- Base delay: 3 seconds
- Max delay: 60 seconds
- Smart detection: Extracts wait time from API error messages
If you encounter persistent rate limits, consider:
- Using a model with higher limits
- Reducing
--questionsparameter - Increasing
--batch-sizeto reduce total API calls
MIT License
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing) - Open a Pull Request