A powerful, standalone command-line tool for extracting Cyber Threat Intelligence (CTI) from documents using Large Language Models with advanced structured output capabilities.
- Multi-format Document Support: PDF, DOCX, TXT, and Markdown files
- Advanced LLM Integration: Google Gemini models with structured output and automatic fallback parsing
- Comprehensive CTI Schema: 12 entity types and 24 relationship types with rich properties
- Detection Opportunity Generation: Creates actionable threat detection rules with evidence backing
- Attack Flow Synthesis: Generates evidence-based MITRE ATT&CK flow diagrams
- Structured Output Reliability: Pydantic schemas with 100% JSON parsing success rate
- Parallel Processing: Process multiple files concurrently with progress tracking
- Portable & Self-contained: Minimal dependencies with environment-based configuration
# Clone or download the standalone-tdo folder
cd standalone-tdo
# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtThe tool uses environment variables loaded from a .env file for configuration:
# Copy the example configuration
cp env.example .env
# Edit .env with your settings
# Required variables:
GEMINI_API_KEY=your-google-ai-api-key-here
GEMINI_MODEL=gemini-2.5-flashGetting a Gemini API Key:
- Visit Google AI Studio
- Create a new API key
- Copy the key to your
.envfile
Available Models:
gemini-2.5-flash-preview-05-20(recommended - fast and cost-effective)gemini-2.5-pro-preview-06-05(more powerful, slower)
# Process a single file with all features
python tdo_bulk.py report.pdf --flow --opps
# Process multiple files
python tdo_bulk.py file1.pdf file2.docx file3.txt
# Process all files in a directory with parallel workers
python tdo_bulk.py reports/ -j 4
# Generate comprehensive analysis with evaluation
python tdo_bulk.py report.pdf --flow --opps --eval-oppsusage: tdo_bulk.py [options] FILE [FILE ...]
Positional Arguments:
FILE One or more files or directories to process
Core Options:
-o, --output DIR Output directory (default: ./extracted_data)
--csv FILE Export summary to CSV file
-j, --jobs N Number of parallel workers (default: 1)
--retries N LLM retry count (default: 2)
--backoff SEC Exponential back-off base (default: 1.5)
Analysis Features:
--flow Generate Attack-Flow JSON
--opps Generate Threat Detection Opportunities
--eval-opps Evaluate detection opportunities quality
--debug-opps Show debug information for opportunity generation
Output Control:
-q, --quiet Minimal console output
-v, --verbose Debug-level logging
-h, --help Show help message and exit
# Basic CTI extraction
python tdo_bulk.py threat_report.pdf
# Full analysis with all features
python tdo_bulk.py apt_report.pdf --flow --opps --eval-opps
# Batch processing with parallel workers
python tdo_bulk.py reports_folder/ -j 8 --flow --opps
# Export results to CSV
python tdo_bulk.py *.pdf --csv results.csv
# Quiet mode for automation
python tdo_bulk.py reports/ -q -o /var/soc/cti --oppsFor each processed file, the tool generates:
{filename}_extracted.json: Structured CTI data with comprehensive schema{filename}_{timestamp}.md: Human-readable markdown report with all analysis
- Attack Flow JSON: Evidence-based MITRE ATT&CK flow structure (with
--flow) - Detection Opportunities: Actionable detection rules with evidence (with
--opps)
| Entity Type | Description | Key Properties |
|---|---|---|
| ThreatActor | Cyber threat groups | aliases, primary_motivation, first_seen |
| Tool | Legitimate software | family, capabilities, kill_chain_phases |
| Malware | Malicious software | family, capabilities, first_seen, last_seen |
| Technique | MITRE ATT&CK techniques | id (T1234), description, kill_chain_phases |
| Tactic | MITRE ATT&CK tactics | id (TA0001), description |
| Infrastructure | IPs, domains, URLs | type, tags, first_seen, last_seen |
| Indicator | File hashes, patterns | value, pattern, valid_from, valid_until |
| Vulnerability | CVE entries | id, cvss_score, affected_software |
| Campaign | Named attack campaigns | objective, status, first_seen |
| Identity | Target organizations | type, description |
| CourseOfAction | Mitigations, patches | type, description |
| Source | Document provenance | filename, document_title, file_size_mb |
Attribution & Actor Relationships:
USES_TOOL,USES_TECHNIQUE,CONDUCTS,ATTRIBUTED_TO,TARGETS
Technical Relationships:
TOOL_IMPLEMENTS_TECHNIQUE,HOSTS_ON,COMMUNICATES_WITH,VARIANT_OFDROPS,DOWNLOADS,INDICATES,OBSERVED_ON,EXPLOITS
Advanced Relationships:
MITIGATES,DETECTS,IS_SUBTECHNIQUE_OF,FLOW_CONTAINS_STEPFLOW_USED_BY_ACTOR,OBSERVES,SOURCED_FROM
All relationships support properties like confidence, source, first_seen, last_seen.
The tool generates evidence-based detection opportunities with:
- Technique Mapping: Links to specific MITRE ATT&CK techniques
- Observable Artefacts: Specific indicators to detect
- Behavioral Patterns: Sequences and patterns to monitor
- Evidence Citations: Direct quotes from source reports
- Confidence Scores: Reliability assessment (0.0-1.0)
- Quality Evaluation: Automated scoring with criteria breakdown
{
"id": "opp-001",
"name": "Detect PowerShell Process Injection",
"technique_id": "T1055",
"artefacts": ["powershell.exe with WriteProcessMemory calls"],
"behaviours": ["Process injection into legitimate processes"],
"rationale": "APT groups commonly use PowerShell for process injection",
"confidence": 0.8,
"source": "PowerShell used for process injection (T1055)",
"evidence": [
"β’ PowerShell executes WriteProcessMemory calls (line 45)",
"β’ Relationship: ThreatActor USES_TECHNIQUE T1055"
]
}Enhanced attack flows provide:
- Evidence-based Ordering: Steps backed by temporal and causal evidence
- Entity Referencing: Each step references extracted CTI entities
- Explicit Reasoning: Justification for step ordering with citations
- Comprehensive Coverage: Up to 25 steps covering full attack lifecycle
{
"flow": {
"label": "AttackFlow",
"pk": "attack-flow--uuid",
"properties": {
"name": "APT29 Multi-stage Attack",
"description": "Sophisticated spear-phishing to data exfiltration flow"
}
},
"steps": [
{
"order": 1,
"entity": {"label": "Technique", "pk": "T1566.001"},
"description": "Spear-phishing attachment delivery",
"reason": "Initial access method cited in report section 2.1"
}
]
}- Entry point with argument parsing and environment validation
- Loads
.envconfiguration usingpython-dotenv - Orchestrates bulk processing with progress tracking
- Manages parallel processing with ThreadPoolExecutor
- Coordinates extraction pipeline stages
- Handles progress callbacks and error recovery
- Document Parsing: Multi-format support (PDF, DOCX, TXT, MD)
- LLM Integration: Google Gemini with structured output
- Fallback Parsing: Manual JSON parsing when structured output fails
- Token Management: Automatic chunking for large documents
- Schema Validation: Pydantic models ensure data consistency
- Evidence-based detection opportunity creation
- MITRE ATT&CK technique mapping
- Quality evaluation with scoring criteria
- Debug mode for detailed analysis
- LLM-powered flow generation with evidence backing
- Rule-based fallback for simple flows
- Step ordering with explicit reasoning
- Entity relationship analysis
- Human-readable markdown report creation
- Structured data presentation
- Integration of all analysis components
- Centralized prompt templates
- Structured output optimization
- Best practices for LLM interaction
Document Input β Text Extraction β LLM Processing β Structured Output
β β β β
PDF/DOCX Clean Text Gemini API JSON + Fallback
β β β β
Multi-format Preprocessing Pydantic Schema Validation
β β β β
File Support Text Cleaning Structured Data CTI Graph
β
Post-Processing
β β
Attack Flows Detection Opps
β β
Markdown Report Export
| Variable | Required | Description | Example |
|---|---|---|---|
GEMINI_API_KEY |
β | Google AI API key | AIza... |
GEMINI_MODEL |
β | Gemini model to use | gemini-2.5-flash-preview-05-20 |
| Model | Speed | Quality | Cost | Use Case |
|---|---|---|---|---|
gemini-2.5-flash-preview-05-20 |
β‘ Fast | π― Good | π° Low | Production, bulk processing |
gemini-2.5-pro-preview-06-05 |
π Slow | π Excellent | πΈ High | Complex analysis, research |
Error: Required environment variables missing: GEMINI_API_KEY, GEMINI_MODEL
Solution: Create .env file with both required variables
Skipping unsupported file: document.rtf
Solution: Convert to PDF, DOCX, TXT, or MD format
Structured Gemini API error: ...
Solutions:
- Check API key validity
- Verify model name spelling
- Try with smaller documents first
- Use
--verbosefor detailed error logs
The tool automatically handles JSON parsing issues with:
- Structured output as primary method
- Manual parsing fallback
- JSON repair for truncated responses
- Graceful error recovery
- Parallel Processing: Use
-jflag for multiple files - Model Selection: Use faster models for bulk processing
- Quiet Mode: Use
-qfor automation scripts - Output Management: Organize with custom output directories
# Enable verbose logging
python tdo_bulk.py report.pdf -v
# Debug detection opportunities
python tdo_bulk.py report.pdf --opps --debug-opps
# Export detailed logs
python tdo_bulk.py report.pdf -v > processing.log 2>&1The tool requires Python 3.9+ and these key packages:
google-generativeai: Google AI (Gemini) API client with structured outputPyMuPDF: High-performance PDF text extractionpython-docx: Microsoft Word document processingpandas: Data manipulation and CSV exportpydantic: Schema validation and structured outputpython-dotenv: Environment variable managementcleantext: Text preprocessing utilities
This standalone version:
- Does not include Knowledge Graph or Neo4j integration
- Does not support vector database features for similarity search
- No real-time retrieval augmentation capabilities
- Simplified compared to the full TDO platform
However, it includes all core CTI extraction and analysis capabilities needed for most use cases.
- Environment Configuration: Migrated to
.envfile management with dotenv - Structured Output: 100% reliable JSON parsing with Google Gemini structured output
- Enhanced Schemas: Comprehensive 12 entity types and 24 relationship types
- Detection Opportunities: Evidence-based detection rule generation with evaluation
- Attack Flow Improvements: Enhanced with evidence backing and explicit reasoning
- No Hardcoded Defaults: All configuration comes from environment variables
- Robust Error Handling: Automatic fallback parsing and graceful error recovery
- Never commit your
.envfile - it's excluded by.gitignore - Store
GEMINI_API_KEYsecurely using environment variables - Rotate API keys periodically
- Review SECURITY.md for detailed security guidance
- Document text is sent to Google's Gemini API for processing
- Review Google's AI data usage policies before processing sensitive documents
- Extracted data may contain sensitive information (IOCs, organization names, file paths)
- Review outputs before sharing outside your organization
Extracted JSON and Markdown files may contain:
- IP addresses and domain names (infrastructure indicators)
- File hashes and technical indicators
- Organization and target information
- Detailed threat actor and campaign data
Always review outputs before public sharing.
Please report security vulnerabilities responsibly. See SECURITY.md for details.
This project is licensed under the MIT License - see LICENSE for details.
Contributions are welcome! Please read CONTRIBUTING.md for guidelines.
Need Help? Run python tdo_bulk.py --help for quick reference or check the troubleshooting section above.