Skip to content

Commit 11fb34e

Browse files
committed
Add AI Skill for coding agents (Cursor, Claude Code, etc.)
Add an AI skill that teaches coding agents chdb's multi-data-source analytics API. Includes one-line install script that auto-detects Cursor and Claude Code environments.
1 parent b5ee73a commit 11fb34e

File tree

5 files changed

+941
-0
lines changed

5 files changed

+941
-0
lines changed

.cursor/skills/using-chdb/SKILL.md

Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
---
2+
name: using-chdb
3+
description: Guide for using chdb, an in-process SQL OLAP engine powered by ClickHouse. Specializes in multi-data-source analytics — query and join data from local files, S3, MySQL, PostgreSQL, MongoDB, ClickHouse, HDFS, Azure, GCS, Iceberg, Delta Lake, Hudi and more, using pandas-compatible syntax or raw SQL. Use when the user wants to query data, analyze files, join multiple data sources, work with Parquet/CSV/JSON, or build data pipelines with chdb or DataStore.
4+
---
5+
6+
# Using chdb
7+
8+
chdb is an in-process SQL OLAP engine powered by ClickHouse. No server needed — it runs as a Python library. Its core strength is **unified multi-data-source analytics**: query and join data across local files, cloud storage, databases, and data lakes using familiar pandas syntax or ClickHouse SQL.
9+
10+
## Core Idea: Any Data Source, One API
11+
12+
chdb treats every data source as a queryable table. You can join a local CSV with a PostgreSQL table and an S3 Parquet file in a single query — no ETL, no data movement.
13+
14+
```python
15+
from datastore import DataStore
16+
17+
# Three different sources
18+
logs = DataStore.from_file("app_logs.parquet")
19+
users = DataStore.from_mysql(host="db.example.com:3306", database="prod", table="users", user="reader", password="pass")
20+
events = DataStore.from_s3("s3://analytics-bucket/events/*.parquet", nosign=True)
21+
22+
# Join them with pandas-like syntax
23+
result = (logs
24+
.join(users, left_on="user_id", right_on="id")
25+
.join(events, on="session_id")
26+
.groupby("country")
27+
.agg({"session_id": "count", "duration": "mean"})
28+
.sort_values("count", ascending=False)
29+
)
30+
print(result) # execution happens here — fully lazy until needed
31+
```
32+
33+
## Supported Data Sources
34+
35+
| Source | Factory Method | URI Scheme |
36+
|--------|---------------|------------|
37+
| **Local files** (CSV, Parquet, JSON, Arrow, ORC, Avro, TSV, XML) | `DataStore.from_file(path)` | `file:///path` or just path |
38+
| **S3 / S3-compatible** | `DataStore.from_s3(url)` | `s3://bucket/key` |
39+
| **Google Cloud Storage** | `DataStore.from_gcs(url)` | `gs://bucket/path` |
40+
| **Azure Blob Storage** | `DataStore.from_azure(conn_str, container)` | `az://container/blob` |
41+
| **HDFS** | `DataStore.from_hdfs(uri)` | `hdfs://namenode:port/path` |
42+
| **HTTP/HTTPS** | `DataStore.from_url(url)` | `https://example.com/data.csv` |
43+
| **MySQL** | `DataStore.from_mysql(host, database, table, user, password)` | `mysql://user:pass@host/db/table` |
44+
| **PostgreSQL** | `DataStore.from_postgresql(host, database, table, user, password)` | `postgresql://user:pass@host/db/table` |
45+
| **ClickHouse (remote)** | `DataStore.from_clickhouse(host, database, table)` | `clickhouse://host/db/table` |
46+
| **MongoDB** | `DataStore.from_mongodb(host, database, collection, user, password)` | `mongodb://user:pass@host/db.collection` |
47+
| **SQLite** | `DataStore.from_sqlite(database_path, table)` | `sqlite:///path?table=name` |
48+
| **Redis** | `DataStore.from_redis(host, key, structure)` | `redis://host/db?key=mykey` |
49+
| **Apache Iceberg** | `DataStore.from_iceberg(url)` | `iceberg://catalog/ns/table` |
50+
| **Delta Lake** | `DataStore.from_delta(url)` | `deltalake:///path/to/table` |
51+
| **Apache Hudi** | `DataStore.from_hudi(url)` | `hudi:///path/to/table` |
52+
53+
All sources can also be created via the universal `DataStore.uri()` method:
54+
55+
```python
56+
ds = DataStore.uri("s3://my-bucket/data.parquet?nosign=true")
57+
ds = DataStore.uri("mysql://root:pass@localhost:3306/mydb/users")
58+
ds = DataStore.uri("postgresql://postgres:pass@host:5432/analytics/events")
59+
ds = DataStore.uri("deltalake:///data/warehouse/events")
60+
```
61+
62+
## DataStore: Pandas-Compatible Multi-Source API
63+
64+
DataStore provides pandas-compatible syntax that compiles to optimized ClickHouse SQL under the hood. All operations are **lazy** — execution only triggers when results are actually needed (print, len, iteration, etc.).
65+
66+
### Create
67+
68+
```python
69+
from datastore import DataStore
70+
71+
# From dict / DataFrame (in-memory)
72+
ds = DataStore({'name': ['Alice', 'Bob'], 'age': [25, 30], 'city': ['NYC', 'LA']})
73+
74+
# From files (auto-detect format by extension)
75+
ds = DataStore.from_file("sales.parquet")
76+
ds = DataStore.from_file("logs/*.csv") # glob patterns supported
77+
78+
# From databases
79+
ds = DataStore.from_mysql(host="localhost:3306", database="shop", table="orders", user="root", password="pass")
80+
ds = DataStore.from_postgresql(host="pg.example.com:5432", database="analytics", table="events", user="analyst", password="pass")
81+
82+
# From cloud storage
83+
ds = DataStore.from_s3("s3://bucket/path/to/data.parquet", access_key_id="KEY", secret_access_key="SECRET")
84+
ds = DataStore.from_s3("s3://public-bucket/data.parquet", nosign=True)
85+
86+
# From data lakes
87+
ds = DataStore.from_iceberg("s3://bucket/iceberg/table", access_key_id="KEY", secret_access_key="SECRET")
88+
ds = DataStore.from_delta("s3://bucket/delta/table", access_key_id="KEY", secret_access_key="SECRET")
89+
```
90+
91+
### Filter, Select, Sort
92+
93+
```python
94+
# Pandas-style
95+
result = ds[ds['age'] > 25]
96+
result = ds[['name', 'city']]
97+
result = ds.sort_values('age', ascending=False)
98+
99+
# SQL-style fluent API
100+
result = ds.select("name", "city").filter(ds['age'] > 25).sort("name").limit(10)
101+
```
102+
103+
### GroupBy & Aggregation
104+
105+
```python
106+
ds.groupby('city')['salary'].mean()
107+
ds.groupby('department').agg({'salary': 'sum', 'name': 'count'})
108+
ds.groupby(['region', 'product']).agg({'revenue': ['sum', 'mean'], 'quantity': 'sum'})
109+
```
110+
111+
### Join Across Sources
112+
113+
```python
114+
# Local Parquet + MySQL + S3 — all in one pipeline
115+
local = DataStore.from_file("products.parquet")
116+
db = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
117+
cloud = DataStore.from_s3("s3://analytics/customers.parquet", nosign=True)
118+
119+
result = (db
120+
.join(local, left_on="product_id", right_on="id")
121+
.join(cloud, left_on="customer_id", right_on="id")
122+
.groupby("category")
123+
.agg({"amount": "sum", "order_id": "count"})
124+
.sort_values("sum", ascending=False)
125+
)
126+
print(result)
127+
```
128+
129+
### Mutation & Transformation
130+
131+
```python
132+
ds.assign(bonus=ds['salary'] * 0.1)
133+
ds.with_column("full_name", ds['first'] + ' ' + ds['last'])
134+
ds.drop(columns=['temp_col'])
135+
ds.rename(columns={'old': 'new'})
136+
ds.fillna(0)
137+
ds.distinct()
138+
```
139+
140+
### Inspection
141+
142+
```python
143+
ds.columns # column names (triggers execution)
144+
ds.shape # (rows, cols)
145+
ds.head(5) # first 5 rows
146+
ds.describe() # statistics
147+
ds.to_sql() # view generated SQL
148+
ds.explain() # execution plan
149+
```
150+
151+
## Raw SQL: Query Any Source Directly
152+
153+
```python
154+
import chdb
155+
156+
# Local files
157+
chdb.query("SELECT * FROM file('data.parquet', Parquet) WHERE price > 100 LIMIT 10")
158+
159+
# S3
160+
chdb.query("SELECT count() FROM s3('s3://bucket/logs/*.parquet', 'KEY', 'SECRET', 'Parquet')")
161+
162+
# MySQL
163+
chdb.query("SELECT * FROM mysql('host:3306', 'mydb', 'users', 'root', 'pass') WHERE active = 1")
164+
165+
# PostgreSQL
166+
chdb.query("SELECT * FROM postgresql('host:5432', 'db', 'events', 'user', 'pass') ORDER BY ts DESC LIMIT 100")
167+
168+
# Cross-source SQL join
169+
chdb.query("""
170+
SELECT u.name, o.amount, o.product
171+
FROM mysql('db:3306', 'shop', 'users', 'root', 'pass') AS u
172+
JOIN file('orders.parquet', Parquet) AS o ON u.id = o.user_id
173+
WHERE o.amount > 100
174+
ORDER BY o.amount DESC
175+
""")
176+
177+
# Data lake formats
178+
chdb.query("SELECT * FROM iceberg('s3://bucket/iceberg/table', 'KEY', 'SECRET') LIMIT 10")
179+
chdb.query("SELECT * FROM deltaLake('s3://bucket/delta/table', 'KEY', 'SECRET') LIMIT 10")
180+
181+
# URL
182+
chdb.query("SELECT * FROM url('https://example.com/api/data.json', JSONEachRow) LIMIT 5")
183+
184+
# Python dict / DataFrame as table
185+
data = {"name": ["Alice", "Bob"], "score": [95, 87]}
186+
chdb.query("SELECT * FROM Python(data) ORDER BY score DESC")
187+
```
188+
189+
## Session: Stateful Analysis
190+
191+
```python
192+
from chdb import session as chs
193+
194+
sess = chs.Session() # in-memory
195+
sess = chs.Session("./my_database") # persistent
196+
197+
# Create tables, insert, query — state persists
198+
sess.query("CREATE TABLE events (ts DateTime, type String, user_id UInt32) ENGINE = MergeTree() ORDER BY ts")
199+
sess.query("INSERT INTO events VALUES (now(), 'click', 1001), (now(), 'view', 1002)")
200+
201+
# Combine local tables with external sources
202+
sess.query("""
203+
SELECT e.type, u.name, count() AS cnt
204+
FROM events e
205+
JOIN mysql('db:3306', 'prod', 'users', 'root', 'pass') AS u ON e.user_id = u.id
206+
GROUP BY e.type, u.name
207+
ORDER BY cnt DESC
208+
""", "Pretty").show()
209+
210+
sess.close()
211+
```
212+
213+
## File Format Auto-Detection
214+
215+
| Extension | Format |
216+
|-----------|--------|
217+
| .csv | CSVWithNames |
218+
| .tsv | TSVWithNames |
219+
| .parquet, .pq | Parquet |
220+
| .json | JSON |
221+
| .jsonl, .ndjson | JSONEachRow |
222+
| .arrow | Arrow |
223+
| .orc | ORC |
224+
| .avro | Avro |
225+
| .xml | XML |
226+
227+
Glob patterns supported: `DataStore.from_file("logs/2024-*.parquet")`
228+
229+
## Installation
230+
231+
```bash
232+
pip install chdb
233+
```
234+
235+
Python 3.9+, macOS and Linux (x86_64, ARM64).
236+
237+
## Additional Resources
238+
239+
- For complete API reference, see [reference.md](reference.md)
240+
- For more usage examples, see [examples.md](examples.md)

0 commit comments

Comments
 (0)