Skip to content

Blevene/standalone_tdo

Repository files navigation

TDO Standalone Extractor

A powerful, standalone command-line tool for extracting Cyber Threat Intelligence (CTI) from documents using Large Language Models with advanced structured output capabilities.

πŸš€ Features

  • Multi-format Document Support: PDF, DOCX, TXT, and Markdown files
  • Advanced LLM Integration: Google Gemini models with structured output and automatic fallback parsing
  • Comprehensive CTI Schema: 12 entity types and 24 relationship types with rich properties
  • Detection Opportunity Generation: Creates actionable threat detection rules with evidence backing
  • Attack Flow Synthesis: Generates evidence-based MITRE ATT&CK flow diagrams
  • Structured Output Reliability: Pydantic schemas with 100% JSON parsing success rate
  • Parallel Processing: Process multiple files concurrently with progress tracking
  • Portable & Self-contained: Minimal dependencies with environment-based configuration

πŸ“‹ Quick Start

1. Installation

# Clone or download the standalone-tdo folder
cd standalone-tdo

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Configuration

The tool uses environment variables loaded from a .env file for configuration:

# Copy the example configuration
cp env.example .env

# Edit .env with your settings
# Required variables:
GEMINI_API_KEY=your-google-ai-api-key-here
GEMINI_MODEL=gemini-2.5-flash

Getting a Gemini API Key:

  1. Visit Google AI Studio
  2. Create a new API key
  3. Copy the key to your .env file

Available Models:

  • gemini-2.5-flash-preview-05-20 (recommended - fast and cost-effective)
  • gemini-2.5-pro-preview-06-05 (more powerful, slower)

3. Basic Usage

# Process a single file with all features
python tdo_bulk.py report.pdf --flow --opps

# Process multiple files
python tdo_bulk.py file1.pdf file2.docx file3.txt

# Process all files in a directory with parallel workers
python tdo_bulk.py reports/ -j 4

# Generate comprehensive analysis with evaluation
python tdo_bulk.py report.pdf --flow --opps --eval-opps

🎯 Command Line Reference

usage: tdo_bulk.py [options] FILE [FILE ...]

Positional Arguments:
  FILE                  One or more files or directories to process

Core Options:
  -o, --output DIR      Output directory (default: ./extracted_data)
  --csv FILE            Export summary to CSV file
  -j, --jobs N          Number of parallel workers (default: 1)
  --retries N           LLM retry count (default: 2)
  --backoff SEC         Exponential back-off base (default: 1.5)

Analysis Features:
  --flow                Generate Attack-Flow JSON
  --opps                Generate Threat Detection Opportunities
  --eval-opps           Evaluate detection opportunities quality
  --debug-opps          Show debug information for opportunity generation

Output Control:
  -q, --quiet           Minimal console output
  -v, --verbose         Debug-level logging
  -h, --help            Show help message and exit

Example Commands

# Basic CTI extraction
python tdo_bulk.py threat_report.pdf

# Full analysis with all features
python tdo_bulk.py apt_report.pdf --flow --opps --eval-opps

# Batch processing with parallel workers
python tdo_bulk.py reports_folder/ -j 8 --flow --opps

# Export results to CSV
python tdo_bulk.py *.pdf --csv results.csv

# Quiet mode for automation
python tdo_bulk.py reports/ -q -o /var/soc/cti --opps

πŸ“Š Output Files

For each processed file, the tool generates:

Primary Outputs

  • {filename}_extracted.json: Structured CTI data with comprehensive schema
  • {filename}_{timestamp}.md: Human-readable markdown report with all analysis

Optional Outputs (with flags)

  • Attack Flow JSON: Evidence-based MITRE ATT&CK flow structure (with --flow)
  • Detection Opportunities: Actionable detection rules with evidence (with --opps)

πŸ” CTI Schema Overview

Entity Types (12 Types)

Entity Type Description Key Properties
ThreatActor Cyber threat groups aliases, primary_motivation, first_seen
Tool Legitimate software family, capabilities, kill_chain_phases
Malware Malicious software family, capabilities, first_seen, last_seen
Technique MITRE ATT&CK techniques id (T1234), description, kill_chain_phases
Tactic MITRE ATT&CK tactics id (TA0001), description
Infrastructure IPs, domains, URLs type, tags, first_seen, last_seen
Indicator File hashes, patterns value, pattern, valid_from, valid_until
Vulnerability CVE entries id, cvss_score, affected_software
Campaign Named attack campaigns objective, status, first_seen
Identity Target organizations type, description
CourseOfAction Mitigations, patches type, description
Source Document provenance filename, document_title, file_size_mb

Relationship Types (24 Types)

Attribution & Actor Relationships:

  • USES_TOOL, USES_TECHNIQUE, CONDUCTS, ATTRIBUTED_TO, TARGETS

Technical Relationships:

  • TOOL_IMPLEMENTS_TECHNIQUE, HOSTS_ON, COMMUNICATES_WITH, VARIANT_OF
  • DROPS, DOWNLOADS, INDICATES, OBSERVED_ON, EXPLOITS

Advanced Relationships:

  • MITIGATES, DETECTS, IS_SUBTECHNIQUE_OF, FLOW_CONTAINS_STEP
  • FLOW_USED_BY_ACTOR, OBSERVES, SOURCED_FROM

All relationships support properties like confidence, source, first_seen, last_seen.

πŸ’‘ Detection Opportunities

The tool generates evidence-based detection opportunities with:

Core Features

  • Technique Mapping: Links to specific MITRE ATT&CK techniques
  • Observable Artefacts: Specific indicators to detect
  • Behavioral Patterns: Sequences and patterns to monitor
  • Evidence Citations: Direct quotes from source reports
  • Confidence Scores: Reliability assessment (0.0-1.0)
  • Quality Evaluation: Automated scoring with criteria breakdown

Example Output

{
  "id": "opp-001",
  "name": "Detect PowerShell Process Injection",
  "technique_id": "T1055",
  "artefacts": ["powershell.exe with WriteProcessMemory calls"],
  "behaviours": ["Process injection into legitimate processes"],
  "rationale": "APT groups commonly use PowerShell for process injection",
  "confidence": 0.8,
  "source": "PowerShell used for process injection (T1055)",
  "evidence": [
    "β€’ PowerShell executes WriteProcessMemory calls (line 45)",
    "β€’ Relationship: ThreatActor USES_TECHNIQUE T1055"
  ]
}

πŸ”— Attack Flow Synthesis

Enhanced attack flows provide:

  • Evidence-based Ordering: Steps backed by temporal and causal evidence
  • Entity Referencing: Each step references extracted CTI entities
  • Explicit Reasoning: Justification for step ordering with citations
  • Comprehensive Coverage: Up to 25 steps covering full attack lifecycle

Example Attack Flow

{
  "flow": {
    "label": "AttackFlow",
    "pk": "attack-flow--uuid",
    "properties": {
      "name": "APT29 Multi-stage Attack",
      "description": "Sophisticated spear-phishing to data exfiltration flow"
    }
  },
  "steps": [
    {
      "order": 1,
      "entity": {"label": "Technique", "pk": "T1566.001"},
      "description": "Spear-phishing attachment delivery",
      "reason": "Initial access method cited in report section 2.1"
    }
  ]
}

πŸ—οΈ Architecture Overview

Core Components

🎯 Main CLI (tdo_bulk.py)

  • Entry point with argument parsing and environment validation
  • Loads .env configuration using python-dotenv
  • Orchestrates bulk processing with progress tracking

βš™οΈ Bulk Runner (tdo_bulk_runner.py)

  • Manages parallel processing with ThreadPoolExecutor
  • Coordinates extraction pipeline stages
  • Handles progress callbacks and error recovery

🧠 CTI Extractor (tdo_core/extractors/report_processor.py)

  • Document Parsing: Multi-format support (PDF, DOCX, TXT, MD)
  • LLM Integration: Google Gemini with structured output
  • Fallback Parsing: Manual JSON parsing when structured output fails
  • Token Management: Automatic chunking for large documents
  • Schema Validation: Pydantic models ensure data consistency

πŸ” TDO Generator (tdo_core/detection/opportunity_generator.py)

  • Evidence-based detection opportunity creation
  • MITRE ATT&CK technique mapping
  • Quality evaluation with scoring criteria
  • Debug mode for detailed analysis

🌊 Attack Flow Synthesizer (tdo_core/flows/attack_flow_synthesizer.py)

  • LLM-powered flow generation with evidence backing
  • Rule-based fallback for simple flows
  • Step ordering with explicit reasoning
  • Entity relationship analysis

πŸ“ Report Generator (tdo_core/report/md_export.py)

  • Human-readable markdown report creation
  • Structured data presentation
  • Integration of all analysis components

🎨 Prompt Management (tdo_core/llm/prompts.py)

  • Centralized prompt templates
  • Structured output optimization
  • Best practices for LLM interaction

Data Flow

Document Input β†’ Text Extraction β†’ LLM Processing β†’ Structured Output
     ↓              ↓                 ↓               ↓
   PDF/DOCX      Clean Text       Gemini API     JSON + Fallback
     ↓              ↓                 ↓               ↓
 Multi-format   Preprocessing    Pydantic Schema  Validation
     ↓              ↓                 ↓               ↓
File Support   Text Cleaning    Structured Data   CTI Graph
                                      ↓
                              Post-Processing
                                 ↓         ↓
                          Attack Flows  Detection Opps
                                 ↓         ↓
                            Markdown Report Export

πŸ› οΈ Advanced Configuration

Environment Variables

Variable Required Description Example
GEMINI_API_KEY βœ… Google AI API key AIza...
GEMINI_MODEL βœ… Gemini model to use gemini-2.5-flash-preview-05-20

Model Selection Guide

Model Speed Quality Cost Use Case
gemini-2.5-flash-preview-05-20 ⚑ Fast 🎯 Good πŸ’° Low Production, bulk processing
gemini-2.5-pro-preview-06-05 🐌 Slow πŸ† Excellent πŸ’Έ High Complex analysis, research

πŸ”§ Troubleshooting

Common Issues

Missing Environment Variables

Error: Required environment variables missing: GEMINI_API_KEY, GEMINI_MODEL

Solution: Create .env file with both required variables

Unsupported File Format

Skipping unsupported file: document.rtf

Solution: Convert to PDF, DOCX, TXT, or MD format

LLM Processing Errors

Structured Gemini API error: ...

Solutions:

  • Check API key validity
  • Verify model name spelling
  • Try with smaller documents first
  • Use --verbose for detailed error logs

JSON Parsing Issues

The tool automatically handles JSON parsing issues with:

  • Structured output as primary method
  • Manual parsing fallback
  • JSON repair for truncated responses
  • Graceful error recovery

Performance Optimization

  • Parallel Processing: Use -j flag for multiple files
  • Model Selection: Use faster models for bulk processing
  • Quiet Mode: Use -q for automation scripts
  • Output Management: Organize with custom output directories

Debug Mode

# Enable verbose logging
python tdo_bulk.py report.pdf -v

# Debug detection opportunities
python tdo_bulk.py report.pdf --opps --debug-opps

# Export detailed logs
python tdo_bulk.py report.pdf -v > processing.log 2>&1

πŸ“¦ Dependencies

The tool requires Python 3.9+ and these key packages:

  • google-generativeai: Google AI (Gemini) API client with structured output
  • PyMuPDF: High-performance PDF text extraction
  • python-docx: Microsoft Word document processing
  • pandas: Data manipulation and CSV export
  • pydantic: Schema validation and structured output
  • python-dotenv: Environment variable management
  • cleantext: Text preprocessing utilities

🚧 Limitations

This standalone version:

  • Does not include Knowledge Graph or Neo4j integration
  • Does not support vector database features for similarity search
  • No real-time retrieval augmentation capabilities
  • Simplified compared to the full TDO platform

However, it includes all core CTI extraction and analysis capabilities needed for most use cases.

πŸ”„ Recent Enhancements

  • Environment Configuration: Migrated to .env file management with dotenv
  • Structured Output: 100% reliable JSON parsing with Google Gemini structured output
  • Enhanced Schemas: Comprehensive 12 entity types and 24 relationship types
  • Detection Opportunities: Evidence-based detection rule generation with evaluation
  • Attack Flow Improvements: Enhanced with evidence backing and explicit reasoning
  • No Hardcoded Defaults: All configuration comes from environment variables
  • Robust Error Handling: Automatic fallback parsing and graceful error recovery

πŸ”’ Security Considerations

API Key Protection

  • Never commit your .env file - it's excluded by .gitignore
  • Store GEMINI_API_KEY securely using environment variables
  • Rotate API keys periodically
  • Review SECURITY.md for detailed security guidance

Data Privacy

  • Document text is sent to Google's Gemini API for processing
  • Review Google's AI data usage policies before processing sensitive documents
  • Extracted data may contain sensitive information (IOCs, organization names, file paths)
  • Review outputs before sharing outside your organization

Output Data

Extracted JSON and Markdown files may contain:

  • IP addresses and domain names (infrastructure indicators)
  • File hashes and technical indicators
  • Organization and target information
  • Detailed threat actor and campaign data

Always review outputs before public sharing.

Reporting Security Issues

Please report security vulnerabilities responsibly. See SECURITY.md for details.

πŸ“„ License

This project is licensed under the MIT License - see LICENSE for details.

🀝 Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines.


Need Help? Run python tdo_bulk.py --help for quick reference or check the troubleshooting section above.

About

No description, website, or topics provided.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages