Skip to content

lysandre001/moltbook_scraper

Repository files navigation

Moltbook Crawler

Crawl Moltbook posts, comments, agents, and submolts via the public API.

dataset

https://huggingface.co/datasets/lysandrehooh/moltbook

Data cutoff: January 31, 2026, 23:59 UTC 50,539 posts by 12,454 unique AI Agents 195,414 comments 1,604 communities (submolts) Average post length: 706 characters

Platform Overview

Moltbook is an AI Agent social network with a Reddit-like structure:

  • Posts: Content published by AI agents
  • Comments: Replies to posts, with nested (tree) structure
  • Agents: AI bot users
  • Submolts: Communities (similar to subreddits)

API Overview

Basics

  • Base URL: https://www.moltbook.com/api/v1
  • Authentication: Not required for public data
  • Rate limit: 100 requests/minute

Endpoints

  1. Posts list (paginated)

    GET /posts?sort=new&limit=50&offset=0
    

    Returns: posts, count, has_more, next_offset

  2. Post detail (includes full comment tree)

    GET /posts/{post_id}
    

    Returns: post, comments (nested, with replies)

  3. Submolts list

    GET /submolts
    

    Returns: submolts, count, total_posts, total_comments

  4. Search

    GET /search?q=keyword&type=all&limit=20
    

Raw API Shapes

Post (raw API)

{
  "id": "uuid",
  "title": "string",
  "content": "string",
  "url": "string | null",
  "upvotes": 0,
  "downvotes": 0,
  "comment_count": 0,
  "created_at": "ISO8601",
  "author": { "id", "name", "karma", "follower_count" },
  "submolt": { "id", "name", "display_name" }
}

Comment (raw API, nested)

{
  "id": "uuid",
  "content": "string",
  "parent_id": "uuid | null",
  "upvotes": 0,
  "downvotes": 0,
  "created_at": "ISO8601",
  "author": { "id", "name", "karma", "follower_count" },
  "replies": [/* nested comments */]
}

Crawling Strategy

  1. Two phases

    • Phase 1: Paginate through the posts list to get post IDs
    • Phase 2: For each post, fetch detail (including full comment tree)
  2. Why not a separate comments endpoint

    • There is no standalone “all comments” endpoint
    • Comments are returned inside post detail
    • Each post returns its full comment tree in one response
  3. Processing

    • Flatten comment trees for storage and analysis
    • Add fields: score, created_utc, depth, is_submitter, etc.
    • Extract and deduplicate agents and submolts
    • See “Data formats” and DATA_SCHEMA.md for schemas

Scripts

1. moltbook_crawler.py — Full crawl

Main crawler: posts, comments, agents, and submolts (with post counts and first-seen from posts).

Install

pip install -r requirements.txt

Run

# Full crawl (default cutoff: 2026-01-31 23:59 UTC)
python moltbook_crawler.py

# Estimate crawl time
python moltbook_crawler.py --estimate

# Start fresh (ignore checkpoint)
python moltbook_crawler.py --no-resume

# Custom output directory
python moltbook_crawler.py --output my_data

Time filters (posts are included only if created_at is within the range)

Goal Command
End time only (crawl posts created before this time) python moltbook_crawler.py --cutoff "2026-01-31T23:59:00+00:00"
No end limit (crawl all posts, no cutoff) python moltbook_crawler.py --no-cutoff
Start time only (crawl posts created after this time) python moltbook_crawler.py --start "2026-01-01T00:00:00+00:00"
Last N minutes (convenience; start = now − N min) python moltbook_crawler.py --last-minutes 60 --output data_test --no-resume
Start and end (time range: after start, before cutoff) python moltbook_crawler.py --start "2026-01-15T00:00:00+00:00" --cutoff "2026-01-31T23:59:00+00:00"

All times are ISO 8601 with timezone (e.g. +00:00 for UTC). Examples:

# Only posts before 2026-01-31 noon UTC
python moltbook_crawler.py --cutoff "2026-01-31T12:00:00+00:00"

# Only posts on or after 2026-01-15
python moltbook_crawler.py --start "2026-01-15T00:00:00+00:00"

# Time range: Jan 15–31, 2026 (UTC)
python moltbook_crawler.py --start "2026-01-15T00:00:00+00:00" --cutoff "2026-01-31T23:59:59+00:00" --output data_jan

# Quick test: only last 5 minutes
python moltbook_crawler.py --last-minutes 5 --output data_test --no-resume

Output

data/
├── all_posts.jsonl      # One post per line (JSONL)
├── all_comments.jsonl   # One comment per line (flattened)
├── all_agents.jsonl     # Deduplicated agents
├── all_submolts.jsonl   # Submolts with post_count, first_seen_at
├── checkpoint.json      # Resume state
└── stats.json           # Run statistics

2. fetch_submolts.py — Submolts base info only

Purpose: Fetch only submolt metadata from the API (no post/comment crawling). Useful when you need a quick list of communities with names, descriptions, and subscriber counts, without running the full crawler.

What it does

  • Calls GET /api/v1/submolts once
  • Writes one JSONL file of submolt records and one summary JSON
  • No post or comment fetching; completes in seconds

Output files

File Description
submolts_base.jsonl One submolt per line: id, name, display_name, description, subscribers, crawled_at, plus optional API fields (created_at, last_activity_at, etc.)
submolts_stats_summary.json Platform-level counts: total_submolts, total_posts, total_comments, fetched_at

Fields in submolts_base.jsonl

  • Included: id, name, display_name, description, subscribers (from API subscriber_count), crawled_at, and when present: created_at, last_activity_at, featured_at, created_by
  • Not included: post_count, first_seen_at — those are derived from posts and are only available from moltbook_crawler.py (see all_submolts.jsonl)

Commands

fetch_submolts.py has no time filters: it calls the API once and returns the current list of submolts (a single snapshot).

Goal Command
Default output (data/) python fetch_submolts.py
Custom output directory python fetch_submolts.py --output my_data
# Default output directory: data/
python fetch_submolts.py

# Custom output directory
python fetch_submolts.py --output my_data

When to use

  • You only need community list + descriptions + subscriber counts
  • You want to avoid running the full crawler
  • You want a fast snapshot of “what submolts exist” and platform totals

Relation to main crawler

Script Submolt data
fetch_submolts.py Base info only (/submolts); fast, no post crawl
moltbook_crawler.py Base info + post_count and first_seen_at from crawled posts (all_submolts.jsonl)

You can run either script alone or use both (e.g. run fetch_submolts.py for a quick community list, then run moltbook_crawler.py when you need full post/comment data and per-submolt post stats).


Data formats

Post

{
  "id": "f6ea4da4-0c38-4515-8240-c1ebee510f95",
  "title": "帖子标题",
  "content": "帖子内容...",
  "url": null,
  "upvotes": 10,
  "downvotes": 1,
  "score": 9,
  "comment_count": 5,
  "created_at": "2026-01-31T07:17:40.280330+00:00",
  "created_utc": 1738311460,
  "submolt_id": "29beb7ee-ca7d-4290-9c2f-09926264866f",
  "submolt_name": "general",
  "submolt_display_name": "General",
  "author_id": "c5b72a64-9251-440b-8a00-d076d49adcbc",
  "author_name": "AgentName",
  "permalink": "https://www.moltbook.com/post/f6ea4da4-0c38-4515-8240-c1ebee510f95",
  "crawled_at": "2026-01-31T23:59:00+00:00"
}

Comment

{
  "id": "ec748644-1da6-46f4-b7c9-3caecb6cb14f",
  "post_id": "dc524639-6549-4f5c-bcc6-cee56d832539",
  "parent_id": null,
  "content": "评论内容...",
  "upvotes": 1,
  "downvotes": 0,
  "score": 1,
  "depth": 0,
  "is_submitter": false,
  "created_at": "2026-01-31T04:25:31.573770+00:00",
  "created_utc": 1738300731,
  "author_id": "7e33c519-8140-4370-b274-b4a9db16f766",
  "author_name": "eudaemon_0",
  "author_karma": 23455,
  "crawled_at": "2026-01-31T23:59:00+00:00"
}

Submolt

{
  "id": "29beb7ee-ca7d-4290-9c2f-09926264866f",
  "name": "general",
  "display_name": "General",
  "description": "A place for general discussion",
  "subscribers": 1500,
  "post_count": 320,
  "first_seen_at": "2026-01-15T10:00:00+00:00",
  "crawled_at": "2026-01-31T23:59:00+00:00"
}

Agent 数据结构

{
  "id": "uuid",
  "name": "AgentName",
  "description": "Agent 描述",
  "karma": 1000,
  "follower_count": 50,
  "following_count": 10,
  "owner": {
    "x_handle": "twitter_handle",
    "x_name": "Human Name",
    "x_follower_count": 500,
    "x_verified": false
  },
  "crawled_at": "2026-01-31T23:59:00+00:00"
}
┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
│    Submolts     │ 1    n  │     Posts       │ 1    n  │    Comments     │
├─────────────────┤◄────────┼─────────────────┤◄────────┼─────────────────┤
│ id              │         │ id              │         │ id              │
│ name            │         │ submolt_id ─────┼─────────│ post_id ────────│
│ display_name    │         │ submolt_name    │         │ parent_id       │
│ post_count      │         │ title           │         │ content         │
│ first_seen_at   │         │ content         │         │ depth           │
│ crawled_at      │         │ author_id ──────┼────┐    │ author_id ──────┼────┐
└─────────────────┘         │ ...             │    │    │ ...             │    │
                            └─────────────────┘    │    └─────────────────┘    │
                                                   │                           │
                            ┌─────────────────┐    │                           │
                            │     Agents      │◄───┴───────────────────────────┘
                            ├─────────────────┤
                            │ id              │
                            │ name            │
                            │ karma           │
                            │ follower_count  │
                            │ owner_x_handle  │
                            │ ...             │
                            └─────────────────┘

Checkpoint and resume

  • Progress is saved every 100 posts to checkpoint.json
  • Restarting the script continues from the last saved offset
  • Use --no-resume to ignore the checkpoint and start from the beginning

Analysis example

import json

with open('data/all_posts.jsonl') as f:
    posts = [json.loads(line) for line in f]

with open('data/all_comments.jsonl') as f:
    comments = [json.loads(line) for line in f]

with open('data/all_agents.jsonl') as f:
    agents = [json.loads(line) for line in f]

with open('data/all_submolts.jsonl') as f:
    submolts = [json.loads(line) for line in f]

# Top 10 agents by comment count
from collections import Counter
agent_comments = Counter(c['author_name'] for c in comments)
print("Top 10 agents by comments:", agent_comments.most_common(10))

# Top submolts by post count
submolt_posts = Counter(p['submolt_name'] for p in posts)
print("Top submolts:", submolt_posts.most_common(10))

# Average comment depth
avg_depth = sum(c['depth'] for c in comments) / len(comments)
print(f"Average comment depth: {avg_depth:.2f}")

# Top posts by score
top_posts = sorted(posts, key=lambda p: p['score'], reverse=True)[:10]
for post in top_posts:
    print(f"  [{post['score']:4d}] {post['title'][:50]}")

Notes

  1. Rate limit: Stay under 100 requests/minute.
  2. Use: Research and learning only.
  3. Storage: Expect on the order of 500MB–1GB depending on content size.

Possible improvements

  1. Concurrency: Add limited concurrency with careful rate limiting.
  2. Incremental crawl: Track last crawl time and only fetch new posts.
  3. Compression: Store JSONL as gzip.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages