This framework is being developed by Slava Tykhonov and highly experimental.
Named after Vladimir Nabokov's novel "Pale Fire", where a poem becomes the subject of extensive commentary and interpretation—just like how this system builds a rich knowledge graph from text and enables intelligent exploration through questions.
"The novel is presented as a 999-line poem, written by the fictional poet John Shade, with a foreword, lengthy commentary, and index written by Shade's neighbor and academic colleague, Charles Kinbote. Together these elements form a narrative in which both fictional authors are central characters. Pale Fire's unusual structure has attracted much attention, and it is often cited as an important example of metafiction, as well as an analog precursor to hypertext fiction, and a poioumenon."
Pale Fire can transform factually correct and evidence-confirmed data points from research datasets into human-readable descriptions based on annotations created by querying knowledge graphs on event entities and LLM integration of new knowledge into understandable narratives. In opposite, it can turn back any human-readable annotation to factually correct data points and link the provenance information as reference.
Use Case Example: If you have data observations on strikes registered on a specific date and place, Pale Fire can:
- Query the knowledge graph for related entities (location, date, event type)
- Retrieve contextual information such as weather conditions and temperature from connected nodes
- Find evidence from witnesses and related sources
- Synthesize all this information into a coherent, human-readable narrative
For instance, given a data point like:
Event: Strike
Date: March 15, 2023
Location: San Francisco, CA
Pale Fire can generate a narrative that includes:
- Historical weather data for that date and location
- Temperature records and conditions
- Related witness accounts or news reports
- Contextual information about similar events
- Temporal relationships to other events in the knowledge graph
Example Narrative Output:
On March 15, 2023, a labor strike occurred in San Francisco, California. The day was characterized by mild spring weather, with temperatures reaching 62°F (17°C) and partly cloudy conditions—typical for early spring in the Bay Area. According to weather records, the morning began with light fog that cleared by mid-day, providing clear visibility for the demonstration that took place in the city's financial district.
Witness accounts from local news reports indicate that approximately 500 workers gathered outside the headquarters of a major tech company, carrying signs and chanting demands for better working conditions. The strike was part of a broader wave of labor actions that had been occurring across California's tech sector throughout early 2023, following similar events in Los Angeles on March 8th and Oakland on March 12th.
This event was temporally connected to a series of related labor actions: it occurred just one week after a similar strike in Seattle, Washington, and preceded another major demonstration in San Jose scheduled for March 22nd. The knowledge graph reveals that these events were part of a coordinated effort by tech workers' unions across the West Coast, responding to industry-wide concerns about workplace safety and compensation.
Historical context from the knowledge graph shows that San Francisco has a long history of labor activism, with notable strikes occurring in 2018 and 2020. The 2023 strike shares similar characteristics with these previous events, particularly in terms of location (financial district) and participant demographics (tech sector workers).
This transforms raw data points into rich, contextualized stories that are both factually accurate and humanly comprehensible.
Pale Fire is an advanced knowledge graph search system featuring:
- 🧠 Question-Type Detection - Automatically understands WHO/WHERE/WHEN/WHAT/WHY/HOW questions
- 🏷️ NER Enrichment - Extracts and tags 18+ entity types (PER, LOC, ORG, DATE, etc.)
- 📊 5-Factor Ranking - Combines semantic, connectivity, temporal, query matching, and entity-type intelligence
- ⚡ CLI Interface - Easy-to-use command-line interface for ingestion and queries
- 🔧 Modular Architecture - Clean separation of concerns for maintainability
- 🤖 AI Agent Daemon - Long-running daemon service that keeps Gensim and spaCy models loaded in memory for instant access
- 🔑 Keyword Extraction - Extract keywords and n-grams (2-4 words) using Gensim with configurable weights (TF-IDF, TextRank, Word Frequency)
- 📄 File Parsing - Extract text from multiple formats: TXT, CSV, PDF, Excel (.xlsx, .xls), OpenDocument (.ods), URLs/HTML
- 📚 Theoretical Foundation - Based on Pale Fire's interpretive framework (see docs/PROS-CONS.md)
# 1. Start all services
docker-compose up -d
# 2. Setup (pull models)
make setup
# 3. Ingest demo data
make ingest-demo
# 4. Run a query
make query
# 5. Access services
# - API: http://localhost:8000
# - API Docs: http://localhost:8000/docs
# - Neo4j: http://localhost:7474See docs/DOCKER.md for complete Docker documentation.
# 1. Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm
# Install keyword extraction (optional but recommended)
pip install gensim>=4.3.0
# Optional: For better stemming support
pip install nltk
# 2. Configure environment
cp env.example .env # Edit with your settings
# 3. View configuration
python palefire-cli.py config
# 4. Ingest demo data
python palefire-cli.py ingest --demo
# 5. Ask a question
python palefire-cli.py query "Who was the California Attorney General in 2020?"# 1. Install dependencies
pip install -r requirements.txt
# 2. Configure environment
cp env.example .env # Edit with your settings
# 3. Start API server
python api.py
# 4. Access API
# - Base URL: http://localhost:8000
# - Interactive docs: http://localhost:8000/docs
# - ReDoc: http://localhost:8000/redocAutomatically detects 8 question types and adjusts entity weights:
# WHO questions → boost person entities 2.0x
python palefire-cli.py query "Who was the Attorney General?"
# WHERE questions → boost location entities 2.0x
python palefire-cli.py query "Where did Kamala Harris work?"
# WHEN questions → boost date entities 2.0x
python palefire-cli.py query "When did Gavin Newsom become governor?"Extract entities automatically during ingestion:
# With NER enrichment (recommended)
python palefire-cli.py ingest --file episodes.json
# Without NER (faster)
python palefire-cli.py ingest --file episodes.json --no-nerChoose the best method for your query:
# Question-aware (recommended for natural questions)
python palefire-cli.py query "Who is Gavin Newsom?" -m question-aware
# Connection-based (for finding central entities)
python palefire-cli.py query "Important people" -m connection
# Standard (fastest, basic RRF)
python palefire-cli.py query "California" -m standard# From file
python palefire-cli.py ingest --file episodes.json
# Demo data
python palefire-cli.py ingest --demo
# Without NER
python palefire-cli.py ingest --file episodes.json --no-ner# Basic query
python palefire-cli.py query "Your question here?"
# With specific method
python palefire-cli.py query "Your question?" --method question-aware
# Export results to JSON
python palefire-cli.py query "Your question?" --export results.json
# Combine method and export
python palefire-cli.py query "Who is X?" -m standard -e output.json
# Short form
python palefire-cli.py query "Who is X?" -m standardpython palefire-cli.py config# Clean database (with confirmation prompt)
python palefire-cli.py clean
# Clean without confirmation
python palefire-cli.py clean --confirm
# Delete only nodes (keep database structure)
python palefire-cli.py clean --nodes-only# Extract keywords from text
python palefire-cli.py keywords "Your text here" --num-keywords 10
# With n-grams (2-4 word phrases)
python palefire-cli.py keywords "Your text here" --min-ngram 2 --max-ngram 3
# Using specific method (tfidf, textrank, frequency, combined)
python palefire-cli.py keywords "Your text" --method combined
# Save to file
python palefire-cli.py keywords "Your text" -o results.json# Auto-detect file type and parse
python palefire-cli.py parse document.pdf
# Parse specific file types
python palefire-cli.py parse-txt document.txt
python palefire-cli.py parse-csv data.csv
python palefire-cli.py parse-pdf document.pdf
python palefire-cli.py parse-spreadsheet data.xlsx
python palefire-cli.py parse-url https://example.com
# Parse with options
python palefire-cli.py parse-csv data.csv --delimiter ";"
python palefire-cli.py parse-pdf document.pdf --max-pages 10
python palefire-cli.py parse-url https://example.com --extract-keywords --keywords-method ner# Start daemon in background
python palefire-cli.py agent start --daemon
# Check status
python palefire-cli.py agent status
# Stop daemon
python palefire-cli.py agent stop
# Restart daemon
python palefire-cli.py agent restart --daemonpython palefire-cli.py --help
python palefire-cli.py ingest --help
python palefire-cli.py query --help
python palefire-cli.py keywords --help
python palefire-cli.py parse --help
python palefire-cli.py agent --helpCreate a JSON file with your episodes:
[
{
"content": "Kamala Harris is the Attorney General of California.",
"type": "text",
"description": "Biography"
},
{
"content": {
"name": "Gavin Newsom",
"position": "Governor",
"state": "California"
},
"type": "json",
"description": "Structured data"
}
]See example_episodes.json for a complete example.
palefire/
├── palefire-cli.py # Main CLI application
├── modules/ # Core modules
│ ├── __init__.py
│ ├── PaleFireCore.py # EntityEnricher + QuestionTypeDetector
│ ├── KeywordBase.py # Keyword extraction (Gensim)
│ └── api_models.py # Pydantic models for API
├── agents/ # AI Agent daemon and parsers
│ ├── AIAgent.py # ModelManager, AIAgentDaemon
│ ├── palefire-agent-service.py # Service script
│ ├── parsers/ # File parsers
│ │ ├── base_parser.py # Base parser class
│ │ ├── txt_parser.py # Text file parser
│ │ ├── csv_parser.py # CSV parser
│ │ ├── pdf_parser.py # PDF parser
│ │ └── spreadsheet_parser.py # Excel/ODS parser
│ └── docker-compose.agent.yml # Docker compose for agent
├── prompts/ # AI/LLM prompts directory
│ ├── system/ # System prompts
│ ├── queries/ # Query-related prompts
│ ├── extraction/ # Extraction prompts
│ └── templates/ # Reusable prompt templates
├── examples/ # Example files for tests
│ ├── input/ # Test input files
│ └── output/ # Test output files
├── example_episodes.json # Example data
├── docs/ # Documentation folder
│ ├── CLI_GUIDE.md # Complete CLI documentation
│ ├── QUICK_REFERENCE.md # Quick reference card
│ ├── ARCHITECTURE.md # Architecture details
│ └── [other documentation]
└── [other files]
See docs/ARCHITECTURE.md for complete architecture documentation. See prompts/README.md for prompts organization guide.
Pale Fire combines 5 independent factors for optimal search results:
- Semantic Relevance (30%) - RRF hybrid search (vector + keyword)
- Connectivity (15%) - How well-connected in the knowledge graph
- Temporal Match (20%) - Active during query time period
- Query Term Match (20%) - Explicit matches of query terms
- Entity Type Match (15%) - Entity types relevant to question type
| Type | Pattern | Boosts | Example |
|---|---|---|---|
| WHO | who, whom, whose | PER (2.0x) | "Who was the AG?" |
| WHERE | where, which place | LOC (2.0x) | "Where did she work?" |
| WHEN | when, what year | DATE (2.0x) | "When was he governor?" |
| WHAT (org) | what organization | ORG (2.0x) | "What organization?" |
| WHAT (position) | what position | PER/ORG (1.5x) | "What position?" |
| HOW MANY | how many | CARDINAL (2.0x) | "How many years?" |
| WHY | why | EVENT (1.5x) | "Why did she leave?" |
| WHAT (event) | what happened | EVENT (2.0x) | "What happened?" |
Automatically extracted with NER:
- PER - Persons (Kamala Harris, Gavin Newsom)
- LOC - Locations (California, San Francisco)
- ORG - Organizations (Attorney General, FBI)
- DATE - Dates (January 3, 2011, 2020)
- TIME - Times (3:00 PM, morning)
- MONEY - Money ($1 million)
- PERCENT - Percentages (50%)
- EVENT - Events (World War II)
- Plus 10 more types
All configuration is centralized in config.py and loaded from .env:
# Copy example configuration
cp env.example .env
# Edit with your settings
nano .env
# View current configuration
python palefire-cli.py configKey settings:
# Neo4j (required)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password
# LLM Provider
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434/v1
OLLAMA_MODEL=deepseek-r1:7b
OLLAMA_VERIFICATION_MODEL=gpt-oss:latest # Optional: separate model for NER verification
# Search Configuration
DEFAULT_SEARCH_METHOD=question-aware
SEARCH_RESULT_LIMIT=20
SEARCH_TOP_K=5
# Ranking Weights (must sum to ≤ 1.0)
WEIGHT_CONNECTION=0.15
WEIGHT_TEMPORAL=0.20
WEIGHT_QUERY_MATCH=0.20
WEIGHT_ENTITY_TYPE=0.15See docs/CONFIGURATION.md for complete documentation.
# Ingest
python palefire-cli.py ingest --demo
# Query
python palefire-cli.py query "Who was the California Attorney General in 2020?"
python palefire-cli.py query "Where did Kamala Harris work as DA?"
python palefire-cli.py query "When did Gavin Newsom become governor?"# Create your data file
cat > my_data.json << 'EOF'
[
{
"content": "Your content here...",
"type": "text",
"description": "Your description"
}
]
EOF
# Ingest and query
python palefire-cli.py ingest --file my_data.json
python palefire-cli.py query "Your question?"# Ingest multiple files
for file in data/*.json; do
python palefire-cli.py ingest --file "$file"
done
# Run multiple queries
python palefire-cli.py query "Question 1?"
python palefire-cli.py query "Question 2?"All documentation is located in the docs/ folder. See docs/README.md for the complete documentation index.
New: Research documentation now available! See docs/PROS-CONS.md for the theoretical framework and docs/EVALUATION.md for evaluation methodology.
- docs/DOCKER.md - Docker deployment guide (recommended)
- docs/PALEFIRE_SETUP.md - Manual setup instructions
- docs/QUICK_REFERENCE.md - Quick reference card
- docs/CONFIGURATION.md - Complete configuration guide
- docs/API_GUIDE.md - REST API documentation
- docs/CLI_GUIDE.md - Complete CLI documentation
- docs/RANKING_SYSTEM.md - 5-factor ranking system
- docs/NER_ENRICHMENT.md - NER system guide
- docs/QUESTION_TYPE_DETECTION.md - Question-type detection
- docs/QUERY_MATCH_SCORING.md - Query matching details
- docs/ARCHITECTURE.md - Architecture details
- docs/REFACTORING_UTILS.md - Code organization and utils refactoring
- docs/TESTING.md - Testing guide and best practices
- docs/DATABASE_CLEANUP.md - Database cleanup guide
- docs/EXPORT_FEATURE.md - JSON export feature
- docs/ENTITY_TYPES_UPDATE.md - Entity types in connections
- docs/PROS-CONS.md - Pale Fire framework for dataset representation
- docs/EVALUATION.md - Evaluation framework for interpretive AI systems
- docs/CHANGELOG_CONFIG.md - Configuration migration
- docs/MIGRATION_SUMMARY.md - Migration summary
- docs/EXPORT_CHANGES.md - Export format changes
Pale Fire includes a comprehensive test suite with 126+ tests covering all major components:
- Core modules (EntityEnricher, QuestionTypeDetector)
- AI Agent (ModelManager, AIAgentDaemon) - 47 tests
- File parsers (TXT, CSV, PDF, Spreadsheet) - 20+ tests
- API endpoints and models
- Search functions and ranking algorithms
- Configuration and utilities
# Run all tests
pytest
# Run with coverage
pytest --cov=. --cov-report=html
# Run specific test suite
pytest tests/test_ai_agent.py -v
# Use test runner script
./run_tests.sh coverageSee:
- TESTING_SUMMARY.md - Quick test overview
- docs/TESTING.md - Complete testing guide
- tests/README.md - Test directory reference
graphiti-core>=0.3.0- Knowledge graph frameworkpython-dotenv>=1.0.0- Environment variable managementgensim>=4.3.0- Keyword extraction (for keywords command)spacy>=3.7.0- Named Entity Recognition (optional but recommended)fastapi>=0.104.0- API frameworkuvicorn[standard]>=0.24.0- ASGI serverpydantic>=2.5.0- Data validation
nltk- For better stemming support in keyword extractionpsutil>=5.9.0- System monitoring for AI Agent daemonPyPDF2>=3.0.0orpdfplumber>=0.9.0- PDF parsingopenpyxl>=3.1.0- Excel .xlsx filesxlrd>=2.0.0- Excel .xls filesodfpy>=1.4.0- OpenDocument Spreadsheet (.ods) files
- Docker 20.10+
- Docker Compose 2.0+
- (Optional) NVIDIA Docker for GPU support
Core:
- Python 3.8+
- graphiti-core
- python-dotenv
- Neo4j database
- gensim>=4.3.0 (for keyword extraction)
NER (Optional but Recommended):
- spacy
- en_core_web_sm model
Keyword Extraction (Optional but Recommended):
- gensim>=4.3.0
- nltk (for better stemming support)
Testing:
- pytest
- pytest-asyncio
- pytest-cov
- pytest-mock
| Operation | Time | Notes |
|---|---|---|
| Model loading | 5-10s | One-time per process |
| Keyword extraction | 0.5-1s | Per request |
| Entity extraction (spaCy) | 50-500ms | Per node |
| Entity extraction (pattern) | 10-50ms | Per node |
| Standard search | 100-300ms | RRF only |
| Question-aware search | 500-2000ms | All factors |
| Operation | Time | Notes |
|---|---|---|
| Model loading | 5-10s | One-time on daemon startup |
| Keyword extraction | 0.01-0.1s | 10-100x faster! |
| Entity extraction (spaCy) | 50-500ms | Same as above |
| File parsing | Varies | Depends on file type and size |
| Standard search | 100-300ms | RRF only |
| Question-aware search | 500-2000ms | All factors |
| Operation | Time | Notes |
|---|---|---|
| Question detection | 1-5ms | Regex-based |
cd /path/to/palefire
python palefire-cli.py --helppython -m spacy download en_core_web_smpip install gensim>=4.3.0
# Optional: For better stemming support
pip install nltk# Install all parsing dependencies
pip install PyPDF2>=3.0.0 openpyxl>=3.1.0 xlrd>=2.0.0 odfpy>=1.4.0 requests>=2.31.0 beautifulsoup4>=4.12.0
# Or install individually as needed
pip install PyPDF2>=3.0.0 # For PDF files
pip install openpyxl>=3.1.0 # For .xlsx files
pip install requests>=2.31.0 beautifulsoup4>=4.12.0 # For URL/HTML parsing
pip install xlrd>=2.0.0 # For .xls files
pip install odfpy>=1.4.0 # For .ods files# Check if daemon is already running
python palefire-cli.py agent status
# Check logs
tail -f logs/palefire-agent.log
# Verify dependencies
pip install psutil>=5.9.0# Check Neo4j is running
# Verify credentials in .env- ✅ Use AI Agent Daemon for production - eliminates model loading delays
- ✅ Use NER enrichment for production
- ✅ Use question-aware search for natural questions
- ✅ Batch process large datasets
- ✅ Monitor logs for errors
- ✅ Backup Neo4j database regularly
- ✅ Keep daemon running - models stay loaded, requests are instant
- ✅ Parse files once - reuse parsed text for multiple operations
- ✅ Use appropriate parsers - PDF parsers vary in speed (pdfplumber slower but better)
The AI Agent daemon keeps Gensim and spaCy models loaded in memory to avoid start/stop delays. This is especially useful for production deployments with high request volumes.
- ⚡ Fast Access: Models stay loaded, eliminating 5-10 second initialization delays
- 🔄 Thread-Safe: Safe concurrent access to models via ModelManager
- 📄 File Parsing: Integrated parsers for TXT, CSV, PDF, and spreadsheet files
- 🔑 Keyword Extraction: Fast keyword and n-gram extraction with configurable methods
- 🏷️ Entity Extraction: Instant NER extraction using loaded spaCy models
- 📊 Status Monitoring: Real-time status with process information (PID, memory, CPU)
# Start daemon in background
python palefire-cli.py agent start --daemon
# Check status (shows PID, memory, CPU usage)
python palefire-cli.py agent status
# Stop daemon
python palefire-cli.py agent stop
# Restart daemon
python palefire-cli.py agent restart --daemonfrom agents import get_daemon
# Get daemon instance (models loaded once)
daemon = get_daemon(use_spacy=True)
daemon.model_manager.initialize(use_spacy=True)
# Extract keywords (fast - models already loaded)
keywords = daemon.extract_keywords(
"Your text here",
num_keywords=10,
method='combined',
enable_ngrams=True,
min_ngram=2,
max_ngram=3
)
# Extract entities (fast - models already loaded)
entities = daemon.extract_entities("Your text here")
# Parse files
result = daemon.parse_file("document.pdf")
if result['success']:
text = result['text']
metadata = result['metadata']The keywords command automatically checks if the daemon is running and starts it if needed:
# This will start the daemon automatically if not running
python palefire-cli.py keywords "Your text here"Standalone:
# Start the AI Agent daemon
docker-compose -f agents/docker-compose.agent.yml up -d
# View logs
docker-compose -f agents/docker-compose.agent.yml logs -f
# Stop the agent
docker-compose -f agents/docker-compose.agent.yml downIntegrated with main services:
# Start all services including the agent
docker-compose -f docker-compose.yml -f agents/docker-compose.agent.yml up -dSee agents/DOCKER.md for complete Docker documentation.
See agents/USAGE_GUIDE.md for complete usage guide on starting, stopping, and querying the agent.
Linux (systemd):
# Copy service file
sudo cp agents/palefire-agent.service /etc/systemd/system/
# Edit paths in service file
sudo nano /etc/systemd/system/palefire-agent.service
# Enable and start
sudo systemctl enable palefire-agent
sudo systemctl start palefire-agentmacOS (launchd):
# Copy plist file
cp agents/palefire-agent.plist ~/Library/LaunchAgents/
# Edit paths in plist file
nano ~/Library/LaunchAgents/palefire-agent.plist
# Load service
launchctl load ~/Library/LaunchAgents/palefire-agent.plistThe AI Agent includes integrated file parsers for extracting text from various formats:
- TXT: Plain text files with encoding detection
- CSV: Comma-separated values with delimiter auto-detection
- PDF: Text and table extraction (PyPDF2 or pdfplumber)
- Spreadsheets: Excel (.xlsx, .xls) and OpenDocument (.ods) with multi-sheet support
- URL/HTML: Extract text from web pages using BeautifulSoup with script/style removal
from agents import get_daemon
daemon = get_daemon()
result = daemon.parse_file("document.pdf", max_pages=10)
# Result structure:
# {
# 'text': 'Full extracted text...',
# 'metadata': {'filename': 'document.pdf', 'page_count': 5, ...},
# 'pages': ['Page 1 text...', 'Page 2 text...'],
# 'tables': [{'data': [...], 'headers': [...]}],
# 'success': True,
# 'error': None
# }- ⚡ No Model Loading Delays: Models stay in memory, ready for instant use (10-100x faster!)
- 🔄 Reduced Memory Overhead: Single instance shared across requests
- 📈 Better Performance: Eliminates repeated model initialization
- 🏭 Production Ready: Designed for high-throughput scenarios
- 📄 Unified Interface: Single daemon handles keywords, entities, and file parsing
- REST API wrapper (see docs/API_GUIDE.md)
- AI Agent daemon for model persistence
- File parsers (TXT, CSV, PDF, Spreadsheet)
- Keyword extraction with n-grams
- Comprehensive unit tests for AI Agent (47+ tests)
- Web UI
- Result caching
- Multi-language support
- Custom entity types
- ML-based question detection
- Socket/HTTP communication for daemon
- Additional file formats (DOCX, RTF, etc.)
When adding features:
- Add classes to
modules/PaleFireCore.py - Add functions to
palefire-cli.py - Update documentation
- Test thoroughly
Inherits license from parent Open WebUI project.
For issues or questions:
- Check documentation files in docs/
- Review docs/CLI_GUIDE.md
- Check logs for error messages
- Verify environment configuration
Pale Fire - Intelligent Knowledge Graph Search Made Easy 🚀