Skip to content

Commit f82cc6a

Browse files
committed
Add basic knowledge extraction and query
1 parent 8dbb402 commit f82cc6a

File tree

21 files changed

+2620
-37
lines changed

21 files changed

+2620
-37
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ venv/
3434
env/
3535
python/.venv/
3636
.python-version
37+
llm_config.yaml
3738

3839
# IDE & tooling
3940
.vscode/

README.md

Lines changed: 62 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@ Lance Graph is a Cypher-capable graph query engine built in Rust with Python bin
55
This repository contains:
66

77
- `rust/lance-graph` – the Cypher-capable query engine implemented in Rust
8-
- `python/` – PyO3 bindings and a thin `lance_graph` Python package
8+
- `python/` – PyO3 bindings and Python packages:
9+
- `lance_graph` – thin wrapper around the Rust query engine
10+
- `knowledge_graph` – Lance-backed knowledge graph CLI, API, and utilities
911

1012
## Prerequisites
1113

@@ -62,6 +64,65 @@ result = query.execute({"Person": people})
6264
print(result.to_pydict()) # {'name': ['Bob', 'David'], 'age': [34, 42]}
6365
```
6466

67+
## Knowledge Graph CLI & API
68+
69+
The `knowledge_graph` package layers a simple Lance-backed knowledge graph
70+
service on top of the `lance_graph` engine. It provides:
71+
72+
- A CLI (`knowledge_graph.main`) for initializing storage, running Cypher
73+
queries, and bootstrapping data via heuristic text extraction.
74+
- A reusable FastAPI component, plus a standalone web service
75+
(`knowledge_graph.webservice`) that exposes query and dataset endpoints.
76+
- Storage helpers that persist node and relationship tables as Lance datasets.
77+
78+
### CLI usage
79+
80+
```bash
81+
uv run knowledge_graph --init # initialize storage and schema stub
82+
uv run knowledge_graph --list-datasets # list Lance datasets on disk
83+
uv run knowledge_graph --extract-preview notes.txt
84+
uv run knowledge_graph --extract-preview "Alice joined the graph team"
85+
uv run knowledge_graph --extract-and-add notes.txt
86+
uv run knowledge_graph "MATCH (n) RETURN n LIMIT 5"
87+
uv run knowledge_graph --log-level DEBUG --extract-preview "Inline text"
88+
uv run knowledge_graph --ask "Who is working on the Presto project?"
89+
90+
91+
# Configure LLM extraction (default)
92+
uv sync --extra llm # install optional LLM dependencies
93+
uv sync --extra lance-storage # install Lance dataset support
94+
export OPENAI_API_KEY=sk-...
95+
uv run knowledge_graph --llm-model gpt-4o-mini --extract-preview notes.txt
96+
97+
# Supply additional OpenAI client options via YAML (base_url, headers, etc.)
98+
uv run knowledge_graph --llm-config llm_config.yaml --extract-and-add notes.txt
99+
100+
# Fall back to the heuristic extractor when LLM access is unavailable
101+
uv run knowledge_graph --extractor heuristic --extract-preview notes.txt
102+
103+
```
104+
105+
The default extractor uses OpenAI. Configure credentials via environment
106+
variables supported by the SDK (for example `OPENAI_API_BASE` or
107+
`OPENAI_API_KEY`), or place them in a YAML file passed through `--llm-config`.
108+
Override the model and temperature with `--llm-model` and `--llm-temperature`.
109+
```
110+
111+
By default the CLI writes datasets under `./knowledge_graph_data`. Provide
112+
`--root` and `--schema` to point at alternate storage locations and schema YAML.
113+
114+
### FastAPI service
115+
116+
Run the web service after installing the `knowledge_graph` package (and
117+
dependencies such as FastAPI):
118+
119+
```bash
120+
uv run --package knowledge_graph knowledge_graph-webservice
121+
```
122+
123+
The service exposes endpoints under `/graph`, including `/graph/health`,
124+
`/graph/query`, `/graph/datasets`, and `/graph/schema`.
125+
65126
### Development workflow
66127

67128
For linting and type checks:

python/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,14 @@ maturin develop
4343

4444
- `python/src/` – PyO3 bridge that exposes graph APIs to Python
4545
- `python/python/lance_graph/` – pure-Python wrapper and `__init__`
46+
- `python/python/knowledge_graph/` – CLI, FastAPI, and extractor utilities built on Lance
4647
- `python/python/tests/` – graph-centric functional tests
4748

4849
Refer to the repository root `README.md` for information about the Rust crate.
50+
51+
> Run CLI commands through `uv run knowledge_graph ...`. The default uses an
52+
> LLM-backed extractor; install the LLM extra with `uv sync --extra llm` (or
53+
> `uv pip install -e '.[llm]'`) and configure `OPENAI_API_KEY`. Install
54+
> `uv sync --extra lance-storage` to enable Lance dataset persistence. Supply
55+
> extra options (e.g., `base_url`, HTTP headers) via `--llm-config`. Use
56+
> `--extractor heuristic` to avoid LLM calls during testing or offline work.

python/llm_config.example.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Example LLM client configuration (public-safe template)
2+
# Copy this file to llm_config.yaml and fill in your values.
3+
api_key: YOUR_API_KEY_HERE
4+
# Optional: override base URL for self-hosted or compatible APIs (e.g., OpenAI-compatible gateways)
5+
base_url: https://api.openai.com/v1
6+
# Optional: additional HTTP headers to send with each request
7+
default_headers:
8+
# openai-organization: YOUR_ORG_ID
9+
# Authorization is derived from api_key by the client; do not duplicate here
10+
# Custom-Header: value

python/pyproject.toml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,13 @@
11
[project]
22
name = "lance-graph"
33
dynamic = ["version"]
4-
dependencies = ["pyarrow>=14"]
4+
dependencies = [
5+
"pyarrow>=14",
6+
"pyyaml>=6.0",
7+
"fastapi>=0.104.0",
8+
"uvicorn>=0.24.0",
9+
"pydantic>=2.0.0",
10+
]
511
description = "Python bindings for the lance-graph Cypher engine"
612
authors = [{ name = "Lance Devs", email = "dev@lancedb.com" }]
713
license = { file = "LICENSE" }
@@ -37,6 +43,8 @@ build-backend = "maturin"
3743
[project.optional-dependencies]
3844
tests = ["pytest", "pyarrow>=14", "pandas"]
3945
dev = ["ruff", "pyright"]
46+
llm = ["openai>=1.52.0"]
47+
lance-storage = ["lance>=0.17.0"]
4048

4149
[project.scripts]
4250
knowledge_graph = "knowledge_graph.main:main"

python/python/knowledge_graph/__init__.py

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,19 @@
1313
except ImportError: # pragma: no cover - builder is available in normal installs.
1414
GraphConfigBuilder = object # type: ignore[assignment]
1515

16+
from .component import KnowledgeGraphComponent
17+
from .config import KnowledgeGraphConfig, build_graph_config_from_mapping
18+
from .extraction import (
19+
DEFAULT_STRATEGY,
20+
BaseExtractor,
21+
get_extractor,
22+
preview_extraction,
23+
)
24+
from .extractors import HeuristicExtractor, LLMExtractor
25+
from .service import LanceKnowledgeGraph, create_default_service
26+
from .store import LanceGraphStore
27+
from .webservice import create_app
28+
1629
TableMapping = Mapping[str, pa.Table]
1730

1831

@@ -100,4 +113,20 @@ def build(self) -> KnowledgeGraph:
100113
return KnowledgeGraph(config, self._datasets)
101114

102115

103-
__all__ = ["KnowledgeGraph", "KnowledgeGraphBuilder"]
116+
__all__ = [
117+
"KnowledgeGraph",
118+
"KnowledgeGraphBuilder",
119+
"KnowledgeGraphConfig",
120+
"build_graph_config_from_mapping",
121+
"LanceGraphStore",
122+
"LanceKnowledgeGraph",
123+
"create_default_service",
124+
"KnowledgeGraphComponent",
125+
"create_app",
126+
"DEFAULT_STRATEGY",
127+
"BaseExtractor",
128+
"get_extractor",
129+
"preview_extraction",
130+
"HeuristicExtractor",
131+
"LLMExtractor",
132+
]
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
"""Reusable FastAPI component for the Lance knowledge graph service."""
2+
3+
from __future__ import annotations
4+
5+
from typing import Any, Dict, List, Optional
6+
7+
import pyarrow as pa
8+
import yaml
9+
from fastapi import APIRouter, HTTPException
10+
from pydantic import BaseModel
11+
12+
from .config import KnowledgeGraphConfig
13+
from .service import LanceKnowledgeGraph
14+
from .store import LanceGraphStore
15+
16+
17+
class QueryRequest(BaseModel):
18+
query: str
19+
20+
21+
class QueryResponse(BaseModel):
22+
rows: List[Dict[str, Any]]
23+
row_count: int
24+
25+
26+
class DatasetUpsertRequest(BaseModel):
27+
records: List[Dict[str, Any]]
28+
merge: bool = True
29+
30+
31+
class KnowledgeGraphComponent:
32+
"""Bundle FastAPI routes that expose the Lance knowledge graph."""
33+
34+
def __init__(self, config: Optional[KnowledgeGraphConfig] = None):
35+
self._config = config or KnowledgeGraphConfig.default()
36+
self._service: Optional[LanceKnowledgeGraph] = None
37+
self.router = APIRouter(tags=["knowledge-graph"])
38+
self._setup_routes()
39+
40+
def _get_service(self) -> LanceKnowledgeGraph:
41+
if self._service is None:
42+
try:
43+
self._service = _create_service(self._config)
44+
except FileNotFoundError as exc:
45+
raise HTTPException(status_code=500, detail=str(exc)) from exc
46+
return self._service
47+
48+
def _setup_routes(self) -> None:
49+
@self.router.get("/health")
50+
async def health() -> Dict[str, str]:
51+
return {"status": "healthy", "service": "lance-knowledge-graph"}
52+
53+
@self.router.get("/datasets")
54+
async def list_datasets() -> Dict[str, List[str]]:
55+
service = self._get_service()
56+
names = list(service.dataset_names())
57+
return {"datasets": names}
58+
59+
@self.router.get("/datasets/{name}")
60+
async def get_dataset(name: str, limit: int = 100) -> Dict[str, Any]:
61+
service = self._get_service()
62+
if not service.has_dataset(name):
63+
raise HTTPException(
64+
status_code=404, detail=f"Dataset '{name}' not found"
65+
)
66+
67+
table = service.load_table(name)
68+
rows = table.to_pylist()
69+
if limit is not None:
70+
rows = rows[:limit]
71+
return {"name": name, "row_count": len(rows), "rows": rows}
72+
73+
@self.router.post("/datasets/{name}")
74+
async def upsert_dataset(
75+
name: str, request: DatasetUpsertRequest
76+
) -> Dict[str, Any]:
77+
if not request.records:
78+
raise HTTPException(status_code=400, detail="records cannot be empty")
79+
80+
table = pa.Table.from_pylist(request.records)
81+
service = self._get_service()
82+
service.upsert_table(name, table, merge=request.merge)
83+
return {"status": "ok", "dataset": name, "row_count": table.num_rows}
84+
85+
@self.router.post("/query", response_model=QueryResponse)
86+
async def execute_query(request: QueryRequest) -> QueryResponse:
87+
service = self._get_service()
88+
result = service.query(request.query)
89+
rows = result.to_pylist()
90+
return QueryResponse(rows=rows, row_count=len(rows))
91+
92+
@self.router.get("/schema")
93+
async def get_schema() -> Dict[str, Any]:
94+
schema_path = self._config.resolved_schema_path()
95+
if not schema_path.exists():
96+
raise HTTPException(status_code=404, detail="Schema file not found")
97+
with schema_path.open("r", encoding="utf-8") as handle:
98+
payload = yaml.safe_load(handle) or {}
99+
return {"path": str(schema_path), "schema": payload}
100+
101+
def close(self) -> None:
102+
"""Release retained resources."""
103+
self._service = None
104+
105+
106+
def _create_service(config: KnowledgeGraphConfig) -> LanceKnowledgeGraph:
107+
graph_config = config.load_graph_config()
108+
storage = LanceGraphStore(config)
109+
service = LanceKnowledgeGraph(graph_config, storage=storage)
110+
service.ensure_initialized()
111+
return service

0 commit comments

Comments
 (0)