Skip to content

Pale Fire framework for the integration of LLMs with knowledge graphs

Notifications You must be signed in to change notification settings

4tikhonov/palefire

Repository files navigation

Pale Fire - Intelligent Knowledge Graph Search System

This framework is being developed by Slava Tykhonov and highly experimental.

Named after Vladimir Nabokov's novel "Pale Fire", where a poem becomes the subject of extensive commentary and interpretation—just like how this system builds a rich knowledge graph from text and enables intelligent exploration through questions.

"The novel is presented as a 999-line poem, written by the fictional poet John Shade, with a foreword, lengthy commentary, and index written by Shade's neighbor and academic colleague, Charles Kinbote. Together these elements form a narrative in which both fictional authors are central characters. Pale Fire's unusual structure has attracted much attention, and it is often cited as an important example of metafiction, as well as an analog precursor to hypertext fiction, and a poioumenon."

Example

Pale Fire can transform factually correct and evidence-confirmed data points from research datasets into human-readable descriptions based on annotations created by querying knowledge graphs on event entities and LLM integration of new knowledge into understandable narratives. In opposite, it can turn back any human-readable annotation to factually correct data points and link the provenance information as reference.

Use Case Example: If you have data observations on strikes registered on a specific date and place, Pale Fire can:

  • Query the knowledge graph for related entities (location, date, event type)
  • Retrieve contextual information such as weather conditions and temperature from connected nodes
  • Find evidence from witnesses and related sources
  • Synthesize all this information into a coherent, human-readable narrative

For instance, given a data point like:

Event: Strike
Date: March 15, 2023
Location: San Francisco, CA

Pale Fire can generate a narrative that includes:

  • Historical weather data for that date and location
  • Temperature records and conditions
  • Related witness accounts or news reports
  • Contextual information about similar events
  • Temporal relationships to other events in the knowledge graph

Example Narrative Output:

On March 15, 2023, a labor strike occurred in San Francisco, California. The day was characterized by mild spring weather, with temperatures reaching 62°F (17°C) and partly cloudy conditions—typical for early spring in the Bay Area. According to weather records, the morning began with light fog that cleared by mid-day, providing clear visibility for the demonstration that took place in the city's financial district.

Witness accounts from local news reports indicate that approximately 500 workers gathered outside the headquarters of a major tech company, carrying signs and chanting demands for better working conditions. The strike was part of a broader wave of labor actions that had been occurring across California's tech sector throughout early 2023, following similar events in Los Angeles on March 8th and Oakland on March 12th.

This event was temporally connected to a series of related labor actions: it occurred just one week after a similar strike in Seattle, Washington, and preceded another major demonstration in San Jose scheduled for March 22nd. The knowledge graph reveals that these events were part of a coordinated effort by tech workers' unions across the West Coast, responding to industry-wide concerns about workplace safety and compensation.

Historical context from the knowledge graph shows that San Francisco has a long history of labor activism, with notable strikes occurring in 2018 and 2020. The 2023 strike shares similar characteristics with these previous events, particularly in terms of location (financial district) and participant demographics (tech sector workers).

This transforms raw data points into rich, contextualized stories that are both factually accurate and humanly comprehensible.

Overview

Pale Fire is an advanced knowledge graph search system featuring:

  • 🧠 Question-Type Detection - Automatically understands WHO/WHERE/WHEN/WHAT/WHY/HOW questions
  • 🏷️ NER Enrichment - Extracts and tags 18+ entity types (PER, LOC, ORG, DATE, etc.)
  • 📊 5-Factor Ranking - Combines semantic, connectivity, temporal, query matching, and entity-type intelligence
  • ⚡ CLI Interface - Easy-to-use command-line interface for ingestion and queries
  • 🔧 Modular Architecture - Clean separation of concerns for maintainability
  • 🤖 AI Agent Daemon - Long-running daemon service that keeps Gensim and spaCy models loaded in memory for instant access
  • 🔑 Keyword Extraction - Extract keywords and n-grams (2-4 words) using Gensim with configurable weights (TF-IDF, TextRank, Word Frequency)
  • 📄 File Parsing - Extract text from multiple formats: TXT, CSV, PDF, Excel (.xlsx, .xls), OpenDocument (.ods), URLs/HTML
  • 📚 Theoretical Foundation - Based on Pale Fire's interpretive framework (see docs/PROS-CONS.md)

Quick Start

Docker (Recommended)

# 1. Start all services
docker-compose up -d

# 2. Setup (pull models)
make setup

# 3. Ingest demo data
make ingest-demo

# 4. Run a query
make query

# 5. Access services
# - API: http://localhost:8000
# - API Docs: http://localhost:8000/docs
# - Neo4j: http://localhost:7474

See docs/DOCKER.md for complete Docker documentation.

CLI Usage

# 1. Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm

# Install keyword extraction (optional but recommended)
pip install gensim>=4.3.0
# Optional: For better stemming support
pip install nltk

# 2. Configure environment
cp env.example .env  # Edit with your settings

# 3. View configuration
python palefire-cli.py config

# 4. Ingest demo data
python palefire-cli.py ingest --demo

# 5. Ask a question
python palefire-cli.py query "Who was the California Attorney General in 2020?"

API Usage

# 1. Install dependencies
pip install -r requirements.txt

# 2. Configure environment
cp env.example .env  # Edit with your settings

# 3. Start API server
python api.py

# 4. Access API
# - Base URL: http://localhost:8000
# - Interactive docs: http://localhost:8000/docs
# - ReDoc: http://localhost:8000/redoc

Features

Intelligent Question Detection

Automatically detects 8 question types and adjusts entity weights:

# WHO questions → boost person entities 2.0x
python palefire-cli.py query "Who was the Attorney General?"

# WHERE questions → boost location entities 2.0x
python palefire-cli.py query "Where did Kamala Harris work?"

# WHEN questions → boost date entities 2.0x
python palefire-cli.py query "When did Gavin Newsom become governor?"

NER-Enriched Ingestion

Extract entities automatically during ingestion:

# With NER enrichment (recommended)
python palefire-cli.py ingest --file episodes.json

# Without NER (faster)
python palefire-cli.py ingest --file episodes.json --no-ner

Multiple Search Methods

Choose the best method for your query:

# Question-aware (recommended for natural questions)
python palefire-cli.py query "Who is Gavin Newsom?" -m question-aware

# Connection-based (for finding central entities)
python palefire-cli.py query "Important people" -m connection

# Standard (fastest, basic RRF)
python palefire-cli.py query "California" -m standard

CLI Commands

Ingest Episodes

# From file
python palefire-cli.py ingest --file episodes.json

# Demo data
python palefire-cli.py ingest --demo

# Without NER
python palefire-cli.py ingest --file episodes.json --no-ner

Query Knowledge Graph

# Basic query
python palefire-cli.py query "Your question here?"

# With specific method
python palefire-cli.py query "Your question?" --method question-aware

# Export results to JSON
python palefire-cli.py query "Your question?" --export results.json

# Combine method and export
python palefire-cli.py query "Who is X?" -m standard -e output.json

# Short form
python palefire-cli.py query "Who is X?" -m standard

Show Configuration

python palefire-cli.py config

Clean Database

# Clean database (with confirmation prompt)
python palefire-cli.py clean

# Clean without confirmation
python palefire-cli.py clean --confirm

# Delete only nodes (keep database structure)
python palefire-cli.py clean --nodes-only

Extract Keywords

# Extract keywords from text
python palefire-cli.py keywords "Your text here" --num-keywords 10

# With n-grams (2-4 word phrases)
python palefire-cli.py keywords "Your text here" --min-ngram 2 --max-ngram 3

# Using specific method (tfidf, textrank, frequency, combined)
python palefire-cli.py keywords "Your text" --method combined

# Save to file
python palefire-cli.py keywords "Your text" -o results.json

Parse Files

# Auto-detect file type and parse
python palefire-cli.py parse document.pdf

# Parse specific file types
python palefire-cli.py parse-txt document.txt
python palefire-cli.py parse-csv data.csv
python palefire-cli.py parse-pdf document.pdf
python palefire-cli.py parse-spreadsheet data.xlsx
python palefire-cli.py parse-url https://example.com

# Parse with options
python palefire-cli.py parse-csv data.csv --delimiter ";"
python palefire-cli.py parse-pdf document.pdf --max-pages 10
python palefire-cli.py parse-url https://example.com --extract-keywords --keywords-method ner

Manage AI Agent Daemon

# Start daemon in background
python palefire-cli.py agent start --daemon

# Check status
python palefire-cli.py agent status

# Stop daemon
python palefire-cli.py agent stop

# Restart daemon
python palefire-cli.py agent restart --daemon

Get Help

python palefire-cli.py --help
python palefire-cli.py ingest --help
python palefire-cli.py query --help
python palefire-cli.py keywords --help
python palefire-cli.py parse --help
python palefire-cli.py agent --help

Episode File Format

Create a JSON file with your episodes:

[
    {
        "content": "Kamala Harris is the Attorney General of California.",
        "type": "text",
        "description": "Biography"
    },
    {
        "content": {
            "name": "Gavin Newsom",
            "position": "Governor",
            "state": "California"
        },
        "type": "json",
        "description": "Structured data"
    }
]

See example_episodes.json for a complete example.

Architecture

palefire/
├── palefire-cli.py              # Main CLI application
├── modules/                     # Core modules
│   ├── __init__.py
│   ├── PaleFireCore.py         # EntityEnricher + QuestionTypeDetector
│   ├── KeywordBase.py          # Keyword extraction (Gensim)
│   └── api_models.py           # Pydantic models for API
├── agents/                      # AI Agent daemon and parsers
│   ├── AIAgent.py              # ModelManager, AIAgentDaemon
│   ├── palefire-agent-service.py  # Service script
│   ├── parsers/                 # File parsers
│   │   ├── base_parser.py      # Base parser class
│   │   ├── txt_parser.py       # Text file parser
│   │   ├── csv_parser.py       # CSV parser
│   │   ├── pdf_parser.py       # PDF parser
│   │   └── spreadsheet_parser.py  # Excel/ODS parser
│   └── docker-compose.agent.yml  # Docker compose for agent
├── prompts/                     # AI/LLM prompts directory
│   ├── system/                 # System prompts
│   ├── queries/                # Query-related prompts
│   ├── extraction/             # Extraction prompts
│   └── templates/              # Reusable prompt templates
├── examples/                    # Example files for tests
│   ├── input/                  # Test input files
│   └── output/                 # Test output files
├── example_episodes.json        # Example data
├── docs/                        # Documentation folder
│   ├── CLI_GUIDE.md            # Complete CLI documentation
│   ├── QUICK_REFERENCE.md      # Quick reference card
│   ├── ARCHITECTURE.md         # Architecture details
│   └── [other documentation]
└── [other files]

See docs/ARCHITECTURE.md for complete architecture documentation. See prompts/README.md for prompts organization guide.

5-Factor Ranking System

Pale Fire combines 5 independent factors for optimal search results:

  1. Semantic Relevance (30%) - RRF hybrid search (vector + keyword)
  2. Connectivity (15%) - How well-connected in the knowledge graph
  3. Temporal Match (20%) - Active during query time period
  4. Query Term Match (20%) - Explicit matches of query terms
  5. Entity Type Match (15%) - Entity types relevant to question type

Question Types

Type Pattern Boosts Example
WHO who, whom, whose PER (2.0x) "Who was the AG?"
WHERE where, which place LOC (2.0x) "Where did she work?"
WHEN when, what year DATE (2.0x) "When was he governor?"
WHAT (org) what organization ORG (2.0x) "What organization?"
WHAT (position) what position PER/ORG (1.5x) "What position?"
HOW MANY how many CARDINAL (2.0x) "How many years?"
WHY why EVENT (1.5x) "Why did she leave?"
WHAT (event) what happened EVENT (2.0x) "What happened?"

Entity Types

Automatically extracted with NER:

  • PER - Persons (Kamala Harris, Gavin Newsom)
  • LOC - Locations (California, San Francisco)
  • ORG - Organizations (Attorney General, FBI)
  • DATE - Dates (January 3, 2011, 2020)
  • TIME - Times (3:00 PM, morning)
  • MONEY - Money ($1 million)
  • PERCENT - Percentages (50%)
  • EVENT - Events (World War II)
  • Plus 10 more types

Configuration

All configuration is centralized in config.py and loaded from .env:

# Copy example configuration
cp env.example .env

# Edit with your settings
nano .env

# View current configuration
python palefire-cli.py config

Key settings:

# Neo4j (required)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password

# LLM Provider
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434/v1
OLLAMA_MODEL=deepseek-r1:7b
OLLAMA_VERIFICATION_MODEL=gpt-oss:latest  # Optional: separate model for NER verification

# Search Configuration
DEFAULT_SEARCH_METHOD=question-aware
SEARCH_RESULT_LIMIT=20
SEARCH_TOP_K=5

# Ranking Weights (must sum to ≤ 1.0)
WEIGHT_CONNECTION=0.15
WEIGHT_TEMPORAL=0.20
WEIGHT_QUERY_MATCH=0.20
WEIGHT_ENTITY_TYPE=0.15

See docs/CONFIGURATION.md for complete documentation.

Examples

Example 1: Political Queries

# Ingest
python palefire-cli.py ingest --demo

# Query
python palefire-cli.py query "Who was the California Attorney General in 2020?"
python palefire-cli.py query "Where did Kamala Harris work as DA?"
python palefire-cli.py query "When did Gavin Newsom become governor?"

Example 2: Custom Data

# Create your data file
cat > my_data.json << 'EOF'
[
    {
        "content": "Your content here...",
        "type": "text",
        "description": "Your description"
    }
]
EOF

# Ingest and query
python palefire-cli.py ingest --file my_data.json
python palefire-cli.py query "Your question?"

Example 3: Batch Processing

# Ingest multiple files
for file in data/*.json; do
    python palefire-cli.py ingest --file "$file"
done

# Run multiple queries
python palefire-cli.py query "Question 1?"
python palefire-cli.py query "Question 2?"

Documentation

All documentation is located in the docs/ folder. See docs/README.md for the complete documentation index.

New: Research documentation now available! See docs/PROS-CONS.md for the theoretical framework and docs/EVALUATION.md for evaluation methodology.

Getting Started

API & CLI

Features

Advanced

Research & Theory

Changelog

Testing

Pale Fire includes a comprehensive test suite with 126+ tests covering all major components:

  • Core modules (EntityEnricher, QuestionTypeDetector)
  • AI Agent (ModelManager, AIAgentDaemon) - 47 tests
  • File parsers (TXT, CSV, PDF, Spreadsheet) - 20+ tests
  • API endpoints and models
  • Search functions and ranking algorithms
  • Configuration and utilities
# Run all tests
pytest

# Run with coverage
pytest --cov=. --cov-report=html

# Run specific test suite
pytest tests/test_ai_agent.py -v

# Use test runner script
./run_tests.sh coverage

See:

Requirements

Core Dependencies

  • graphiti-core>=0.3.0 - Knowledge graph framework
  • python-dotenv>=1.0.0 - Environment variable management
  • gensim>=4.3.0 - Keyword extraction (for keywords command)
  • spacy>=3.7.0 - Named Entity Recognition (optional but recommended)
  • fastapi>=0.104.0 - API framework
  • uvicorn[standard]>=0.24.0 - ASGI server
  • pydantic>=2.5.0 - Data validation

Optional Dependencies

  • nltk - For better stemming support in keyword extraction
  • psutil>=5.9.0 - System monitoring for AI Agent daemon
  • PyPDF2>=3.0.0 or pdfplumber>=0.9.0 - PDF parsing
  • openpyxl>=3.1.0 - Excel .xlsx files
  • xlrd>=2.0.0 - Excel .xls files
  • odfpy>=1.4.0 - OpenDocument Spreadsheet (.ods) files

Docker (Recommended)

  • Docker 20.10+
  • Docker Compose 2.0+
  • (Optional) NVIDIA Docker for GPU support

Manual Installation

Core:

  • Python 3.8+
  • graphiti-core
  • python-dotenv
  • Neo4j database
  • gensim>=4.3.0 (for keyword extraction)

NER (Optional but Recommended):

  • spacy
  • en_core_web_sm model

Keyword Extraction (Optional but Recommended):

  • gensim>=4.3.0
  • nltk (for better stemming support)

Testing:

  • pytest
  • pytest-asyncio
  • pytest-cov
  • pytest-mock

Performance

Without AI Agent (models load each time)

Operation Time Notes
Model loading 5-10s One-time per process
Keyword extraction 0.5-1s Per request
Entity extraction (spaCy) 50-500ms Per node
Entity extraction (pattern) 10-50ms Per node
Standard search 100-300ms RRF only
Question-aware search 500-2000ms All factors

With AI Agent (models stay loaded)

Operation Time Notes
Model loading 5-10s One-time on daemon startup
Keyword extraction 0.01-0.1s 10-100x faster!
Entity extraction (spaCy) 50-500ms Same as above
File parsing Varies Depends on file type and size
Standard search 100-300ms RRF only
Question-aware search 500-2000ms All factors

Question Detection

Operation Time Notes
Question detection 1-5ms Regex-based

Troubleshooting

Module not found

cd /path/to/palefire
python palefire-cli.py --help

spaCy model not found

python -m spacy download en_core_web_sm

Gensim not found (for keyword extraction)

pip install gensim>=4.3.0
# Optional: For better stemming support
pip install nltk

File parsing dependencies missing

# Install all parsing dependencies
pip install PyPDF2>=3.0.0 openpyxl>=3.1.0 xlrd>=2.0.0 odfpy>=1.4.0 requests>=2.31.0 beautifulsoup4>=4.12.0

# Or install individually as needed
pip install PyPDF2>=3.0.0  # For PDF files
pip install openpyxl>=3.1.0  # For .xlsx files
pip install requests>=2.31.0 beautifulsoup4>=4.12.0  # For URL/HTML parsing
pip install xlrd>=2.0.0  # For .xls files
pip install odfpy>=1.4.0  # For .ods files

AI Agent daemon not starting

# Check if daemon is already running
python palefire-cli.py agent status

# Check logs
tail -f logs/palefire-agent.log

# Verify dependencies
pip install psutil>=5.9.0

Neo4j connection error

# Check Neo4j is running
# Verify credentials in .env

Best Practices

  1. Use AI Agent Daemon for production - eliminates model loading delays
  2. ✅ Use NER enrichment for production
  3. ✅ Use question-aware search for natural questions
  4. ✅ Batch process large datasets
  5. ✅ Monitor logs for errors
  6. ✅ Backup Neo4j database regularly
  7. Keep daemon running - models stay loaded, requests are instant
  8. Parse files once - reuse parsed text for multiple operations
  9. ✅ Use appropriate parsers - PDF parsers vary in speed (pdfplumber slower but better)

AI Agent Daemon

The AI Agent daemon keeps Gensim and spaCy models loaded in memory to avoid start/stop delays. This is especially useful for production deployments with high request volumes.

Features

  • ⚡ Fast Access: Models stay loaded, eliminating 5-10 second initialization delays
  • 🔄 Thread-Safe: Safe concurrent access to models via ModelManager
  • 📄 File Parsing: Integrated parsers for TXT, CSV, PDF, and spreadsheet files
  • 🔑 Keyword Extraction: Fast keyword and n-gram extraction with configurable methods
  • 🏷️ Entity Extraction: Instant NER extraction using loaded spaCy models
  • 📊 Status Monitoring: Real-time status with process information (PID, memory, CPU)

Quick Start

# Start daemon in background
python palefire-cli.py agent start --daemon

# Check status (shows PID, memory, CPU usage)
python palefire-cli.py agent status

# Stop daemon
python palefire-cli.py agent stop

# Restart daemon
python palefire-cli.py agent restart --daemon

Using the Daemon Programmatically

from agents import get_daemon

# Get daemon instance (models loaded once)
daemon = get_daemon(use_spacy=True)
daemon.model_manager.initialize(use_spacy=True)

# Extract keywords (fast - models already loaded)
keywords = daemon.extract_keywords(
    "Your text here",
    num_keywords=10,
    method='combined',
    enable_ngrams=True,
    min_ngram=2,
    max_ngram=3
)

# Extract entities (fast - models already loaded)
entities = daemon.extract_entities("Your text here")

# Parse files
result = daemon.parse_file("document.pdf")
if result['success']:
    text = result['text']
    metadata = result['metadata']

Automatic Daemon Management

The keywords command automatically checks if the daemon is running and starts it if needed:

# This will start the daemon automatically if not running
python palefire-cli.py keywords "Your text here"

Docker Deployment

Standalone:

# Start the AI Agent daemon
docker-compose -f agents/docker-compose.agent.yml up -d

# View logs
docker-compose -f agents/docker-compose.agent.yml logs -f

# Stop the agent
docker-compose -f agents/docker-compose.agent.yml down

Integrated with main services:

# Start all services including the agent
docker-compose -f docker-compose.yml -f agents/docker-compose.agent.yml up -d

See agents/DOCKER.md for complete Docker documentation.

See agents/USAGE_GUIDE.md for complete usage guide on starting, stopping, and querying the agent.

System Service Integration

Linux (systemd):

# Copy service file
sudo cp agents/palefire-agent.service /etc/systemd/system/
# Edit paths in service file
sudo nano /etc/systemd/system/palefire-agent.service
# Enable and start
sudo systemctl enable palefire-agent
sudo systemctl start palefire-agent

macOS (launchd):

# Copy plist file
cp agents/palefire-agent.plist ~/Library/LaunchAgents/
# Edit paths in plist file
nano ~/Library/LaunchAgents/palefire-agent.plist
# Load service
launchctl load ~/Library/LaunchAgents/palefire-agent.plist

File Parsing Capabilities

The AI Agent includes integrated file parsers for extracting text from various formats:

  • TXT: Plain text files with encoding detection
  • CSV: Comma-separated values with delimiter auto-detection
  • PDF: Text and table extraction (PyPDF2 or pdfplumber)
  • Spreadsheets: Excel (.xlsx, .xls) and OpenDocument (.ods) with multi-sheet support
  • URL/HTML: Extract text from web pages using BeautifulSoup with script/style removal
from agents import get_daemon

daemon = get_daemon()
result = daemon.parse_file("document.pdf", max_pages=10)

# Result structure:
# {
#     'text': 'Full extracted text...',
#     'metadata': {'filename': 'document.pdf', 'page_count': 5, ...},
#     'pages': ['Page 1 text...', 'Page 2 text...'],
#     'tables': [{'data': [...], 'headers': [...]}],
#     'success': True,
#     'error': None
# }

Benefits

  • ⚡ No Model Loading Delays: Models stay in memory, ready for instant use (10-100x faster!)
  • 🔄 Reduced Memory Overhead: Single instance shared across requests
  • 📈 Better Performance: Eliminates repeated model initialization
  • 🏭 Production Ready: Designed for high-throughput scenarios
  • 📄 Unified Interface: Single daemon handles keywords, entities, and file parsing

Future Enhancements

  • REST API wrapper (see docs/API_GUIDE.md)
  • AI Agent daemon for model persistence
  • File parsers (TXT, CSV, PDF, Spreadsheet)
  • Keyword extraction with n-grams
  • Comprehensive unit tests for AI Agent (47+ tests)
  • Web UI
  • Result caching
  • Multi-language support
  • Custom entity types
  • ML-based question detection
  • Socket/HTTP communication for daemon
  • Additional file formats (DOCX, RTF, etc.)

Contributing

When adding features:

  1. Add classes to modules/PaleFireCore.py
  2. Add functions to palefire-cli.py
  3. Update documentation
  4. Test thoroughly

License

Inherits license from parent Open WebUI project.

Support

For issues or questions:

  1. Check documentation files in docs/
  2. Review docs/CLI_GUIDE.md
  3. Check logs for error messages
  4. Verify environment configuration

Pale Fire - Intelligent Knowledge Graph Search Made Easy 🚀

About

Pale Fire framework for the integration of LLMs with knowledge graphs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published