TDO Standalone Extractor

A powerful, standalone command-line tool for extracting Cyber Threat Intelligence (CTI) from documents using Large Language Models with advanced structured output capabilities.

🚀 Features

Multi-format Document Support: PDF, DOCX, TXT, and Markdown files
Advanced LLM Integration: Google Gemini models with structured output and automatic fallback parsing
Comprehensive CTI Schema: 12 entity types and 24 relationship types with rich properties
Detection Opportunity Generation: Creates actionable threat detection rules with evidence backing
Attack Flow Synthesis: Generates evidence-based MITRE ATT&CK flow diagrams
Structured Output Reliability: Pydantic schemas with 100% JSON parsing success rate
Parallel Processing: Process multiple files concurrently with progress tracking
Portable & Self-contained: Minimal dependencies with environment-based configuration

📋 Quick Start

1. Installation

# Clone or download the standalone-tdo folder
cd standalone-tdo

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Configuration

The tool uses environment variables loaded from a .env file for configuration:

# Copy the example configuration
cp env.example .env

# Edit .env with your settings
# Required variables:
GEMINI_API_KEY=your-google-ai-api-key-here
GEMINI_MODEL=gemini-2.5-flash

Getting a Gemini API Key:

Visit Google AI Studio
Create a new API key
Copy the key to your .env file

Available Models:

gemini-2.5-flash-preview-05-20 (recommended - fast and cost-effective)
gemini-2.5-pro-preview-06-05 (more powerful, slower)

3. Basic Usage

# Process a single file with all features
python tdo_bulk.py report.pdf --flow --opps

# Process multiple files
python tdo_bulk.py file1.pdf file2.docx file3.txt

# Process all files in a directory with parallel workers
python tdo_bulk.py reports/ -j 4

# Generate comprehensive analysis with evaluation
python tdo_bulk.py report.pdf --flow --opps --eval-opps

🎯 Command Line Reference

usage: tdo_bulk.py [options] FILE [FILE ...]

Positional Arguments:
  FILE                  One or more files or directories to process

Core Options:
  -o, --output DIR      Output directory (default: ./extracted_data)
  --csv FILE            Export summary to CSV file
  -j, --jobs N          Number of parallel workers (default: 1)
  --retries N           LLM retry count (default: 2)
  --backoff SEC         Exponential back-off base (default: 1.5)

Analysis Features:
  --flow                Generate Attack-Flow JSON
  --opps                Generate Threat Detection Opportunities
  --eval-opps           Evaluate detection opportunities quality
  --debug-opps          Show debug information for opportunity generation

Output Control:
  -q, --quiet           Minimal console output
  -v, --verbose         Debug-level logging
  -h, --help            Show help message and exit

Example Commands

# Basic CTI extraction
python tdo_bulk.py threat_report.pdf

# Full analysis with all features
python tdo_bulk.py apt_report.pdf --flow --opps --eval-opps

# Batch processing with parallel workers
python tdo_bulk.py reports_folder/ -j 8 --flow --opps

# Export results to CSV
python tdo_bulk.py *.pdf --csv results.csv

# Quiet mode for automation
python tdo_bulk.py reports/ -q -o /var/soc/cti --opps

📊 Output Files

For each processed file, the tool generates:

Primary Outputs

{filename}_extracted.json: Structured CTI data with comprehensive schema
{filename}_{timestamp}.md: Human-readable markdown report with all analysis

Optional Outputs (with flags)

Attack Flow JSON: Evidence-based MITRE ATT&CK flow structure (with --flow)
Detection Opportunities: Actionable detection rules with evidence (with --opps)

🔍 CTI Schema Overview

Entity Types (12 Types)

Entity Type	Description	Key Properties
ThreatActor	Cyber threat groups	`aliases`, `primary_motivation`, `first_seen`
Tool	Legitimate software	`family`, `capabilities`, `kill_chain_phases`
Malware	Malicious software	`family`, `capabilities`, `first_seen`, `last_seen`
Technique	MITRE ATT&CK techniques	`id` (T1234), `description`, `kill_chain_phases`
Tactic	MITRE ATT&CK tactics	`id` (TA0001), `description`
Infrastructure	IPs, domains, URLs	`type`, `tags`, `first_seen`, `last_seen`
Indicator	File hashes, patterns	`value`, `pattern`, `valid_from`, `valid_until`
Vulnerability	CVE entries	`id`, `cvss_score`, `affected_software`
Campaign	Named attack campaigns	`objective`, `status`, `first_seen`
Identity	Target organizations	`type`, `description`
CourseOfAction	Mitigations, patches	`type`, `description`
Source	Document provenance	`filename`, `document_title`, `file_size_mb`

Relationship Types (24 Types)

Attribution & Actor Relationships:

USES_TOOL, USES_TECHNIQUE, CONDUCTS, ATTRIBUTED_TO, TARGETS

Technical Relationships:

TOOL_IMPLEMENTS_TECHNIQUE, HOSTS_ON, COMMUNICATES_WITH, VARIANT_OF
DROPS, DOWNLOADS, INDICATES, OBSERVED_ON, EXPLOITS

Advanced Relationships:

MITIGATES, DETECTS, IS_SUBTECHNIQUE_OF, FLOW_CONTAINS_STEP
FLOW_USED_BY_ACTOR, OBSERVES, SOURCED_FROM

All relationships support properties like confidence, source, first_seen, last_seen.

💡 Detection Opportunities

The tool generates evidence-based detection opportunities with:

Core Features

Technique Mapping: Links to specific MITRE ATT&CK techniques
Observable Artefacts: Specific indicators to detect
Behavioral Patterns: Sequences and patterns to monitor
Evidence Citations: Direct quotes from source reports
Confidence Scores: Reliability assessment (0.0-1.0)
Quality Evaluation: Automated scoring with criteria breakdown

Example Output

{
  "id": "opp-001",
  "name": "Detect PowerShell Process Injection",
  "technique_id": "T1055",
  "artefacts": ["powershell.exe with WriteProcessMemory calls"],
  "behaviours": ["Process injection into legitimate processes"],
  "rationale": "APT groups commonly use PowerShell for process injection",
  "confidence": 0.8,
  "source": "PowerShell used for process injection (T1055)",
  "evidence": [
    "• PowerShell executes WriteProcessMemory calls (line 45)",
    "• Relationship: ThreatActor USES_TECHNIQUE T1055"
  ]
}

🔗 Attack Flow Synthesis

Enhanced attack flows provide:

Evidence-based Ordering: Steps backed by temporal and causal evidence
Entity Referencing: Each step references extracted CTI entities
Explicit Reasoning: Justification for step ordering with citations
Comprehensive Coverage: Up to 25 steps covering full attack lifecycle

Example Attack Flow

{
  "flow": {
    "label": "AttackFlow",
    "pk": "attack-flow--uuid",
    "properties": {
      "name": "APT29 Multi-stage Attack",
      "description": "Sophisticated spear-phishing to data exfiltration flow"
    }
  },
  "steps": [
    {
      "order": 1,
      "entity": {"label": "Technique", "pk": "T1566.001"},
      "description": "Spear-phishing attachment delivery",
      "reason": "Initial access method cited in report section 2.1"
    }
  ]
}

🏗️ Architecture Overview

Core Components

🎯 Main CLI (`tdo_bulk.py`)

Entry point with argument parsing and environment validation
Loads .env configuration using python-dotenv
Orchestrates bulk processing with progress tracking

⚙️ Bulk Runner (`tdo_bulk_runner.py`)

Manages parallel processing with ThreadPoolExecutor
Coordinates extraction pipeline stages
Handles progress callbacks and error recovery

🧠 CTI Extractor (`tdo_core/extractors/report_processor.py`)

Document Parsing: Multi-format support (PDF, DOCX, TXT, MD)
LLM Integration: Google Gemini with structured output
Fallback Parsing: Manual JSON parsing when structured output fails
Token Management: Automatic chunking for large documents
Schema Validation: Pydantic models ensure data consistency

🔍 TDO Generator (`tdo_core/detection/opportunity_generator.py`)

Evidence-based detection opportunity creation
MITRE ATT&CK technique mapping
Quality evaluation with scoring criteria
Debug mode for detailed analysis

🌊 Attack Flow Synthesizer (`tdo_core/flows/attack_flow_synthesizer.py`)

LLM-powered flow generation with evidence backing
Rule-based fallback for simple flows
Step ordering with explicit reasoning
Entity relationship analysis

📝 Report Generator (`tdo_core/report/md_export.py`)

Human-readable markdown report creation
Structured data presentation
Integration of all analysis components

🎨 Prompt Management (`tdo_core/llm/prompts.py`)

Centralized prompt templates
Structured output optimization
Best practices for LLM interaction

Data Flow

Document Input → Text Extraction → LLM Processing → Structured Output
     ↓              ↓                 ↓               ↓
   PDF/DOCX      Clean Text       Gemini API     JSON + Fallback
     ↓              ↓                 ↓               ↓
 Multi-format   Preprocessing    Pydantic Schema  Validation
     ↓              ↓                 ↓               ↓
File Support   Text Cleaning    Structured Data   CTI Graph
                                      ↓
                              Post-Processing
                                 ↓         ↓
                          Attack Flows  Detection Opps
                                 ↓         ↓
                            Markdown Report Export

🛠️ Advanced Configuration

Environment Variables

Variable	Required	Description	Example
`GEMINI_API_KEY`	✅	Google AI API key	`AIza...`
`GEMINI_MODEL`	✅	Gemini model to use	`gemini-2.5-flash-preview-05-20`

Model Selection Guide

Model	Speed	Quality	Cost	Use Case
`gemini-2.5-flash-preview-05-20`	⚡ Fast	🎯 Good	💰 Low	Production, bulk processing
`gemini-2.5-pro-preview-06-05`	🐌 Slow	🏆 Excellent	💸 High	Complex analysis, research

🔧 Troubleshooting

Common Issues

Missing Environment Variables

Error: Required environment variables missing: GEMINI_API_KEY, GEMINI_MODEL

Solution: Create .env file with both required variables

Unsupported File Format

Skipping unsupported file: document.rtf

Solution: Convert to PDF, DOCX, TXT, or MD format

LLM Processing Errors

Structured Gemini API error: ...

Solutions:

Check API key validity
Verify model name spelling
Try with smaller documents first
Use --verbose for detailed error logs

JSON Parsing Issues

The tool automatically handles JSON parsing issues with:

Structured output as primary method
Manual parsing fallback
JSON repair for truncated responses
Graceful error recovery

Performance Optimization

Parallel Processing: Use -j flag for multiple files
Model Selection: Use faster models for bulk processing
Quiet Mode: Use -q for automation scripts
Output Management: Organize with custom output directories

Debug Mode

# Enable verbose logging
python tdo_bulk.py report.pdf -v

# Debug detection opportunities
python tdo_bulk.py report.pdf --opps --debug-opps

# Export detailed logs
python tdo_bulk.py report.pdf -v > processing.log 2>&1

📦 Dependencies

The tool requires Python 3.9+ and these key packages:

google-generativeai: Google AI (Gemini) API client with structured output
PyMuPDF: High-performance PDF text extraction
python-docx: Microsoft Word document processing
pandas: Data manipulation and CSV export
pydantic: Schema validation and structured output
python-dotenv: Environment variable management
cleantext: Text preprocessing utilities

🚧 Limitations

This standalone version:

Does not include Knowledge Graph or Neo4j integration
Does not support vector database features for similarity search
No real-time retrieval augmentation capabilities
Simplified compared to the full TDO platform

However, it includes all core CTI extraction and analysis capabilities needed for most use cases.

🔄 Recent Enhancements

Environment Configuration: Migrated to .env file management with dotenv
Structured Output: 100% reliable JSON parsing with Google Gemini structured output
Enhanced Schemas: Comprehensive 12 entity types and 24 relationship types
Detection Opportunities: Evidence-based detection rule generation with evaluation
Attack Flow Improvements: Enhanced with evidence backing and explicit reasoning
No Hardcoded Defaults: All configuration comes from environment variables
Robust Error Handling: Automatic fallback parsing and graceful error recovery

🔒 Security Considerations

API Key Protection

Never commit your .env file - it's excluded by .gitignore
Store GEMINI_API_KEY securely using environment variables
Rotate API keys periodically
Review SECURITY.md for detailed security guidance

Data Privacy

Document text is sent to Google's Gemini API for processing
Review Google's AI data usage policies before processing sensitive documents
Extracted data may contain sensitive information (IOCs, organization names, file paths)
Review outputs before sharing outside your organization

Output Data

Extracted JSON and Markdown files may contain:

IP addresses and domain names (infrastructure indicators)
File hashes and technical indicators
Organization and target information
Detailed threat actor and campaign data

Always review outputs before public sharing.

Reporting Security Issues

Please report security vulnerabilities responsibly. See SECURITY.md for details.

📄 License

This project is licensed under the MIT License - see LICENSE for details.

🤝 Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines.

Need Help? Run python tdo_bulk.py --help for quick reference or check the troubleshooting section above.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
extracted_data		extracted_data
sample_report		sample_report
tdo_core		tdo_core
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
WARP.md		WARP.md
demo_report.txt		demo_report.txt
env.example		env.example
requirements.txt		requirements.txt
tdo_bulk.py		tdo_bulk.py
tdo_bulk_runner.py		tdo_bulk_runner.py

License

Blevene/standalone_tdo

Folders and files

Latest commit

History

Repository files navigation