chdb-io
diff --git a/‎.cursor/skills/using-chdb/SKILL.md‎
Lines changed: 240 additions & 0 deletions b/‎.cursor/skills/using-chdb/SKILL.md‎
Lines changed: 240 additions & 0 deletions
@@ -0,0 +1,240 @@
+---
+name: using-chdb
+description: Guide for using chdb, an in-process SQL OLAP engine powered by ClickHouse. Specializes in multi-data-source analytics — query and join data from local files, S3, MySQL, PostgreSQL, MongoDB, ClickHouse, HDFS, Azure, GCS, Iceberg, Delta Lake, Hudi and more, using pandas-compatible syntax or raw SQL. Use when the user wants to query data, analyze files, join multiple data sources, work with Parquet/CSV/JSON, or build data pipelines with chdb or DataStore.
+---
+
+# Using chdb
+
+chdb is an in-process SQL OLAP engine powered by ClickHouse. No server needed — it runs as a Python library. Its core strength is **unified multi-data-source analytics**: query and join data across local files, cloud storage, databases, and data lakes using familiar pandas syntax or ClickHouse SQL.
+
+## Core Idea: Any Data Source, One API
+
+chdb treats every data source as a queryable table. You can join a local CSV with a PostgreSQL table and an S3 Parquet file in a single query — no ETL, no data movement.
+
+```python
+from datastore import DataStore
+
+# Three different sources
+logs = DataStore.from_file("app_logs.parquet")
+users = DataStore.from_mysql(host="db.example.com:3306", database="prod", table="users", user="reader", password="pass")
+events = DataStore.from_s3("s3://analytics-bucket/events/*.parquet", nosign=True)
+
+# Join them with pandas-like syntax
+result = (logs
+    .join(users, left_on="user_id", right_on="id")
+    .join(events, on="session_id")
+    .groupby("country")
+    .agg({"session_id": "count", "duration": "mean"})
+    .sort_values("count", ascending=False)
+)
+print(result)  # execution happens here — fully lazy until needed
+```
+
+## Supported Data Sources
+
+| Source | Factory Method | URI Scheme |
+|--------|---------------|------------|
+| **Local files** (CSV, Parquet, JSON, Arrow, ORC, Avro, TSV, XML) | `DataStore.from_file(path)` | `file:///path` or just path |
+| **S3 / S3-compatible** | `DataStore.from_s3(url)` | `s3://bucket/key` |
+| **Google Cloud Storage** | `DataStore.from_gcs(url)` | `gs://bucket/path` |
+| **Azure Blob Storage** | `DataStore.from_azure(conn_str, container)` | `az://container/blob` |
+| **HDFS** | `DataStore.from_hdfs(uri)` | `hdfs://namenode:port/path` |
+| **HTTP/HTTPS** | `DataStore.from_url(url)` | `https://example.com/data.csv` |
+| **MySQL** | `DataStore.from_mysql(host, database, table, user, password)` | `mysql://user:pass@host/db/table` |
+| **PostgreSQL** | `DataStore.from_postgresql(host, database, table, user, password)` | `postgresql://user:pass@host/db/table` |
+| **ClickHouse (remote)** | `DataStore.from_clickhouse(host, database, table)` | `clickhouse://host/db/table` |
+| **MongoDB** | `DataStore.from_mongodb(host, database, collection, user, password)` | `mongodb://user:pass@host/db.collection` |
+| **SQLite** | `DataStore.from_sqlite(database_path, table)` | `sqlite:///path?table=name` |
+| **Redis** | `DataStore.from_redis(host, key, structure)` | `redis://host/db?key=mykey` |
+| **Apache Iceberg** | `DataStore.from_iceberg(url)` | `iceberg://catalog/ns/table` |
+| **Delta Lake** | `DataStore.from_delta(url)` | `deltalake:///path/to/table` |
+| **Apache Hudi** | `DataStore.from_hudi(url)` | `hudi:///path/to/table` |
+
+All sources can also be created via the universal `DataStore.uri()` method:
+
+```python
+ds = DataStore.uri("s3://my-bucket/data.parquet?nosign=true")
+ds = DataStore.uri("mysql://root:pass@localhost:3306/mydb/users")
+ds = DataStore.uri("postgresql://postgres:pass@host:5432/analytics/events")
+ds = DataStore.uri("deltalake:///data/warehouse/events")
+```
+
+## DataStore: Pandas-Compatible Multi-Source API
+
+DataStore provides pandas-compatible syntax that compiles to optimized ClickHouse SQL under the hood. All operations are **lazy** — execution only triggers when results are actually needed (print, len, iteration, etc.).
+
+### Create
+
+```python
+from datastore import DataStore
+
+# From dict / DataFrame (in-memory)
+ds = DataStore({'name': ['Alice', 'Bob'], 'age': [25, 30], 'city': ['NYC', 'LA']})
+
+# From files (auto-detect format by extension)
+ds = DataStore.from_file("sales.parquet")
+ds = DataStore.from_file("logs/*.csv")  # glob patterns supported
+
+# From databases
+ds = DataStore.from_mysql(host="localhost:3306", database="shop", table="orders", user="root", password="pass")
+ds = DataStore.from_postgresql(host="pg.example.com:5432", database="analytics", table="events", user="analyst", password="pass")
+
+# From cloud storage
+ds = DataStore.from_s3("s3://bucket/path/to/data.parquet", access_key_id="KEY", secret_access_key="SECRET")
+ds = DataStore.from_s3("s3://public-bucket/data.parquet", nosign=True)
+
+# From data lakes
+ds = DataStore.from_iceberg("s3://bucket/iceberg/table", access_key_id="KEY", secret_access_key="SECRET")
+ds = DataStore.from_delta("s3://bucket/delta/table", access_key_id="KEY", secret_access_key="SECRET")
+```
+
+### Filter, Select, Sort
+
+```python
+# Pandas-style
+result = ds[ds['age'] > 25]
+result = ds[['name', 'city']]
+result = ds.sort_values('age', ascending=False)
+
+# SQL-style fluent API
+result = ds.select("name", "city").filter(ds['age'] > 25).sort("name").limit(10)
+```
+
+### GroupBy & Aggregation
+
+```python
+ds.groupby('city')['salary'].mean()
+ds.groupby('department').agg({'salary': 'sum', 'name': 'count'})
+ds.groupby(['region', 'product']).agg({'revenue': ['sum', 'mean'], 'quantity': 'sum'})
+```
+
+### Join Across Sources
+
+```python
+# Local Parquet + MySQL + S3 — all in one pipeline
+local = DataStore.from_file("products.parquet")
+db = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
+cloud = DataStore.from_s3("s3://analytics/customers.parquet", nosign=True)
+
+result = (db
+    .join(local, left_on="product_id", right_on="id")
+    .join(cloud, left_on="customer_id", right_on="id")
+    .groupby("category")
+    .agg({"amount": "sum", "order_id": "count"})
+    .sort_values("sum", ascending=False)
+)
+print(result)
+```
+
+### Mutation & Transformation
+
+```python
+ds.assign(bonus=ds['salary'] * 0.1)
+ds.with_column("full_name", ds['first'] + ' ' + ds['last'])
+ds.drop(columns=['temp_col'])
+ds.rename(columns={'old': 'new'})
+ds.fillna(0)
+ds.distinct()
+```
+
+### Inspection
+
+```python
+ds.columns       # column names (triggers execution)
+ds.shape         # (rows, cols)
+ds.head(5)       # first 5 rows
+ds.describe()    # statistics
+ds.to_sql()      # view generated SQL
+ds.explain()     # execution plan
+```
+
+## Raw SQL: Query Any Source Directly
+
+```python
+import chdb
+
+# Local files
+chdb.query("SELECT * FROM file('data.parquet', Parquet) WHERE price > 100 LIMIT 10")
+
+# S3
+chdb.query("SELECT count() FROM s3('s3://bucket/logs/*.parquet', 'KEY', 'SECRET', 'Parquet')")
+
+# MySQL
+chdb.query("SELECT * FROM mysql('host:3306', 'mydb', 'users', 'root', 'pass') WHERE active = 1")
+
+# PostgreSQL
+chdb.query("SELECT * FROM postgresql('host:5432', 'db', 'events', 'user', 'pass') ORDER BY ts DESC LIMIT 100")
+
+# Cross-source SQL join
+chdb.query("""
+    SELECT u.name, o.amount, o.product
+    FROM mysql('db:3306', 'shop', 'users', 'root', 'pass') AS u
+    JOIN file('orders.parquet', Parquet) AS o ON u.id = o.user_id
+    WHERE o.amount > 100
+    ORDER BY o.amount DESC
+""")
+
+# Data lake formats
+chdb.query("SELECT * FROM iceberg('s3://bucket/iceberg/table', 'KEY', 'SECRET') LIMIT 10")
+chdb.query("SELECT * FROM deltaLake('s3://bucket/delta/table', 'KEY', 'SECRET') LIMIT 10")
+
+# URL
+chdb.query("SELECT * FROM url('https://example.com/api/data.json', JSONEachRow) LIMIT 5")
+
+# Python dict / DataFrame as table
+data = {"name": ["Alice", "Bob"], "score": [95, 87]}
+chdb.query("SELECT * FROM Python(data) ORDER BY score DESC")
+```
+
+## Session: Stateful Analysis
+
+```python
+from chdb import session as chs
+
+sess = chs.Session()                    # in-memory
+sess = chs.Session("./my_database")     # persistent
+
+# Create tables, insert, query — state persists
+sess.query("CREATE TABLE events (ts DateTime, type String, user_id UInt32) ENGINE = MergeTree() ORDER BY ts")
+sess.query("INSERT INTO events VALUES (now(), 'click', 1001), (now(), 'view', 1002)")
+
+# Combine local tables with external sources
+sess.query("""
+    SELECT e.type, u.name, count() AS cnt
+    FROM events e
+    JOIN mysql('db:3306', 'prod', 'users', 'root', 'pass') AS u ON e.user_id = u.id
+    GROUP BY e.type, u.name
+    ORDER BY cnt DESC
+""", "Pretty").show()
+
+sess.close()
+```
+
+## File Format Auto-Detection
+
+| Extension | Format |
+|-----------|--------|
+| .csv | CSVWithNames |
+| .tsv | TSVWithNames |
+| .parquet, .pq | Parquet |
+| .json | JSON |
+| .jsonl, .ndjson | JSONEachRow |
+| .arrow | Arrow |
+| .orc | ORC |
+| .avro | Avro |
+| .xml | XML |
+
+Glob patterns supported: `DataStore.from_file("logs/2024-*.parquet")`
+
+## Installation
+
+```bash
+pip install chdb
+```
+
+Python 3.9+, macOS and Linux (x86_64, ARM64).
+
+## Additional Resources
+
+- For complete API reference, see [reference.md](reference.md)
+- For more usage examples, see [examples.md](examples.md)