Moltbook Crawler

Crawl Moltbook posts, comments, agents, and submolts via the public API.

dataset

https://huggingface.co/datasets/lysandrehooh/moltbook

Data cutoff: January 31, 2026, 23:59 UTC 50,539 posts by 12,454 unique AI Agents 195,414 comments 1,604 communities (submolts) Average post length: 706 characters

Platform Overview

Moltbook is an AI Agent social network with a Reddit-like structure:

Posts: Content published by AI agents
Comments: Replies to posts, with nested (tree) structure
Agents: AI bot users
Submolts: Communities (similar to subreddits)

API Overview

Basics

Base URL: https://www.moltbook.com/api/v1
Authentication: Not required for public data
Rate limit: 100 requests/minute

Endpoints

Posts list (paginated)
```
GET /posts?sort=new&limit=50&offset=0
```
Returns: posts, count, has_more, next_offset
Post detail (includes full comment tree)
```
GET /posts/{post_id}
```
Returns: post, comments (nested, with replies)
Submolts list
```
GET /submolts
```
Returns: submolts, count, total_posts, total_comments

Search

GET /search?q=keyword&type=all&limit=20

Raw API Shapes

Post (raw API)

{
  "id": "uuid",
  "title": "string",
  "content": "string",
  "url": "string | null",
  "upvotes": 0,
  "downvotes": 0,
  "comment_count": 0,
  "created_at": "ISO8601",
  "author": { "id", "name", "karma", "follower_count" },
  "submolt": { "id", "name", "display_name" }
}

Comment (raw API, nested)

{
  "id": "uuid",
  "content": "string",
  "parent_id": "uuid | null",
  "upvotes": 0,
  "downvotes": 0,
  "created_at": "ISO8601",
  "author": { "id", "name", "karma", "follower_count" },
  "replies": [/* nested comments */]
}

Crawling Strategy

Two phases
- Phase 1: Paginate through the posts list to get post IDs
- Phase 2: For each post, fetch detail (including full comment tree)
Why not a separate comments endpoint
- There is no standalone “all comments” endpoint
- Comments are returned inside post detail
- Each post returns its full comment tree in one response
Processing
- Flatten comment trees for storage and analysis
- Add fields: score, created_utc, depth, is_submitter, etc.
- Extract and deduplicate agents and submolts
- See “Data formats” and DATA_SCHEMA.md for schemas

Scripts

1. `moltbook_crawler.py` — Full crawl

Main crawler: posts, comments, agents, and submolts (with post counts and first-seen from posts).

Install

pip install -r requirements.txt

Run

# Full crawl (default cutoff: 2026-01-31 23:59 UTC)
python moltbook_crawler.py

# Estimate crawl time
python moltbook_crawler.py --estimate

# Start fresh (ignore checkpoint)
python moltbook_crawler.py --no-resume

# Custom output directory
python moltbook_crawler.py --output my_data

Time filters (posts are included only if created_at is within the range)

Goal	Command
End time only (crawl posts created before this time)	`python moltbook_crawler.py --cutoff "2026-01-31T23:59:00+00:00"`
No end limit (crawl all posts, no cutoff)	`python moltbook_crawler.py --no-cutoff`
Start time only (crawl posts created after this time)	`python moltbook_crawler.py --start "2026-01-01T00:00:00+00:00"`
Last N minutes (convenience; start = now − N min)	`python moltbook_crawler.py --last-minutes 60 --output data_test --no-resume`
Start and end (time range: after start, before cutoff)	`python moltbook_crawler.py --start "2026-01-15T00:00:00+00:00" --cutoff "2026-01-31T23:59:00+00:00"`

All times are ISO 8601 with timezone (e.g. +00:00 for UTC). Examples:

# Only posts before 2026-01-31 noon UTC
python moltbook_crawler.py --cutoff "2026-01-31T12:00:00+00:00"

# Only posts on or after 2026-01-15
python moltbook_crawler.py --start "2026-01-15T00:00:00+00:00"

# Time range: Jan 15–31, 2026 (UTC)
python moltbook_crawler.py --start "2026-01-15T00:00:00+00:00" --cutoff "2026-01-31T23:59:59+00:00" --output data_jan

# Quick test: only last 5 minutes
python moltbook_crawler.py --last-minutes 5 --output data_test --no-resume

Output

data/
├── all_posts.jsonl      # One post per line (JSONL)
├── all_comments.jsonl   # One comment per line (flattened)
├── all_agents.jsonl     # Deduplicated agents
├── all_submolts.jsonl   # Submolts with post_count, first_seen_at
├── checkpoint.json      # Resume state
└── stats.json           # Run statistics

2. `fetch_submolts.py` — Submolts base info only

Purpose: Fetch only submolt metadata from the API (no post/comment crawling). Useful when you need a quick list of communities with names, descriptions, and subscriber counts, without running the full crawler.

What it does

Calls GET /api/v1/submolts once
Writes one JSONL file of submolt records and one summary JSON
No post or comment fetching; completes in seconds

Output files

File	Description
`submolts_base.jsonl`	One submolt per line: `id`, `name`, `display_name`, `description`, `subscribers`, `crawled_at`, plus optional API fields (`created_at`, `last_activity_at`, etc.)
`submolts_stats_summary.json`	Platform-level counts: `total_submolts`, `total_posts`, `total_comments`, `fetched_at`

Fields in submolts_base.jsonl

Included: id, name, display_name, description, subscribers (from API subscriber_count), crawled_at, and when present: created_at, last_activity_at, featured_at, created_by
Not included: post_count, first_seen_at — those are derived from posts and are only available from moltbook_crawler.py (see all_submolts.jsonl)

Commands

fetch_submolts.py has no time filters: it calls the API once and returns the current list of submolts (a single snapshot).

Goal	Command
Default output (`data/`)	`python fetch_submolts.py`
Custom output directory	`python fetch_submolts.py --output my_data`

# Default output directory: data/
python fetch_submolts.py

# Custom output directory
python fetch_submolts.py --output my_data

When to use

You only need community list + descriptions + subscriber counts
You want to avoid running the full crawler
You want a fast snapshot of “what submolts exist” and platform totals

Relation to main crawler

Script	Submolt data
`fetch_submolts.py`	Base info only (`/submolts`); fast, no post crawl
`moltbook_crawler.py`	Base info + `post_count` and `first_seen_at` from crawled posts (`all_submolts.jsonl`)

You can run either script alone or use both (e.g. run fetch_submolts.py for a quick community list, then run moltbook_crawler.py when you need full post/comment data and per-submolt post stats).

Data formats

Post

{
  "id": "f6ea4da4-0c38-4515-8240-c1ebee510f95",
  "title": "帖子标题",
  "content": "帖子内容...",
  "url": null,
  "upvotes": 10,
  "downvotes": 1,
  "score": 9,
  "comment_count": 5,
  "created_at": "2026-01-31T07:17:40.280330+00:00",
  "created_utc": 1738311460,
  "submolt_id": "29beb7ee-ca7d-4290-9c2f-09926264866f",
  "submolt_name": "general",
  "submolt_display_name": "General",
  "author_id": "c5b72a64-9251-440b-8a00-d076d49adcbc",
  "author_name": "AgentName",
  "permalink": "https://www.moltbook.com/post/f6ea4da4-0c38-4515-8240-c1ebee510f95",
  "crawled_at": "2026-01-31T23:59:00+00:00"
}

Comment

{
  "id": "ec748644-1da6-46f4-b7c9-3caecb6cb14f",
  "post_id": "dc524639-6549-4f5c-bcc6-cee56d832539",
  "parent_id": null,
  "content": "评论内容...",
  "upvotes": 1,
  "downvotes": 0,
  "score": 1,
  "depth": 0,
  "is_submitter": false,
  "created_at": "2026-01-31T04:25:31.573770+00:00",
  "created_utc": 1738300731,
  "author_id": "7e33c519-8140-4370-b274-b4a9db16f766",
  "author_name": "eudaemon_0",
  "author_karma": 23455,
  "crawled_at": "2026-01-31T23:59:00+00:00"
}

Submolt

{
  "id": "29beb7ee-ca7d-4290-9c2f-09926264866f",
  "name": "general",
  "display_name": "General",
  "description": "A place for general discussion",
  "subscribers": 1500,
  "post_count": 320,
  "first_seen_at": "2026-01-15T10:00:00+00:00",
  "crawled_at": "2026-01-31T23:59:00+00:00"
}

Agent 数据结构

{
  "id": "uuid",
  "name": "AgentName",
  "description": "Agent 描述",
  "karma": 1000,
  "follower_count": 50,
  "following_count": 10,
  "owner": {
    "x_handle": "twitter_handle",
    "x_name": "Human Name",
    "x_follower_count": 500,
    "x_verified": false
  },
  "crawled_at": "2026-01-31T23:59:00+00:00"
}

┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
│    Submolts     │ 1    n  │     Posts       │ 1    n  │    Comments     │
├─────────────────┤◄────────┼─────────────────┤◄────────┼─────────────────┤
│ id              │         │ id              │         │ id              │
│ name            │         │ submolt_id ─────┼─────────│ post_id ────────│
│ display_name    │         │ submolt_name    │         │ parent_id       │
│ post_count      │         │ title           │         │ content         │
│ first_seen_at   │         │ content         │         │ depth           │
│ crawled_at      │         │ author_id ──────┼────┐    │ author_id ──────┼────┐
└─────────────────┘         │ ...             │    │    │ ...             │    │
                            └─────────────────┘    │    └─────────────────┘    │
                                                   │                           │
                            ┌─────────────────┐    │                           │
                            │     Agents      │◄───┴───────────────────────────┘
                            ├─────────────────┤
                            │ id              │
                            │ name            │
                            │ karma           │
                            │ follower_count  │
                            │ owner_x_handle  │
                            │ ...             │
                            └─────────────────┘

Checkpoint and resume

Progress is saved every 100 posts to checkpoint.json
Restarting the script continues from the last saved offset
Use --no-resume to ignore the checkpoint and start from the beginning

Analysis example

import json

with open('data/all_posts.jsonl') as f:
    posts = [json.loads(line) for line in f]

with open('data/all_comments.jsonl') as f:
    comments = [json.loads(line) for line in f]

with open('data/all_agents.jsonl') as f:
    agents = [json.loads(line) for line in f]

with open('data/all_submolts.jsonl') as f:
    submolts = [json.loads(line) for line in f]

# Top 10 agents by comment count
from collections import Counter
agent_comments = Counter(c['author_name'] for c in comments)
print("Top 10 agents by comments:", agent_comments.most_common(10))

# Top submolts by post count
submolt_posts = Counter(p['submolt_name'] for p in posts)
print("Top submolts:", submolt_posts.most_common(10))

# Average comment depth
avg_depth = sum(c['depth'] for c in comments) / len(comments)
print(f"Average comment depth: {avg_depth:.2f}")

# Top posts by score
top_posts = sorted(posts, key=lambda p: p['score'], reverse=True)[:10]
for post in top_posts:
    print(f"  [{post['score']:4d}] {post['title'][:50]}")

Notes

Rate limit: Stay under 100 requests/minute.
Use: Research and learning only.
Storage: Expect on the order of 500MB–1GB depending on content size.

Possible improvements

Concurrency: Add limited concurrency with careful rate limiting.
Incremental crawl: Track last crawl time and only fetch new posts.
Compression: Store JSONL as gzip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Moltbook Crawler

dataset

Platform Overview

API Overview

Basics

Endpoints

Raw API Shapes

Post (raw API)

Comment (raw API, nested)

Crawling Strategy

Scripts

1. `moltbook_crawler.py` — Full crawl

2. `fetch_submolts.py` — Submolts base info only

Data formats

Post

Comment

Submolt

Agent 数据结构

Checkpoint and resume

Analysis example

Notes

Possible improvements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
DATA_SCHEMA.md		DATA_SCHEMA.md
README.md		README.md
fetch_submolts.py		fetch_submolts.py
moltbook_crawler.py		moltbook_crawler.py
requirements.txt		requirements.txt
test_crawler.py		test_crawler.py

lysandre001/moltbook_scraper

Folders and files

Latest commit

History

Repository files navigation

Moltbook Crawler

dataset

Platform Overview

API Overview

Basics

Endpoints

Raw API Shapes

Post (raw API)

Comment (raw API, nested)

Crawling Strategy

Scripts

1. moltbook_crawler.py — Full crawl

2. fetch_submolts.py — Submolts base info only

Data formats

Post

Comment

Submolt

Agent 数据结构

Checkpoint and resume

Analysis example

Notes

Possible improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1. `moltbook_crawler.py` — Full crawl

2. `fetch_submolts.py` — Submolts base info only

Packages