Skip to content

feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization)#717

Merged
jackwener merged 3 commits intomainfrom
feat/v2ex-autoresearch
Apr 3, 2026
Merged

feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization)#717
jackwener merged 3 commits intomainfrom
feat/v2ex-autoresearch

Conversation

@jackwener
Copy link
Copy Markdown
Owner

Summary

  • AutoResearch framework — Karpathy-style autonomous iteration loop (engine, config, logger, 4 commands, 3 presets)
  • V2EX test suite — 60 deterministic tasks across 7 layers of difficulty
  • SKILL.md optimization — aggressive chaining rules, minimize-turns guidance

Results

Layer 1: Deterministic Commands

Score: 60/60 (100%)
L1-atomic:      15/15
L2-single-page:  9/9
L3-multi-step:  10/10
L4-write:        5/5
L5-complex:      5/5
L6-interaction: 10/10
L7-chain:        6/6

Layer 2: Claude Code E2E (11 tasks, all pass)

Task Turns Cost Result
Extract hot topics 4 $0.21 PASS
Navigate node 3 $0.25 PASS
Click + read topic 5 $0.23 PASS
Multi-step (tab->topic->replies) 7 $0.32 PASS
Form type + verify 9 $0.32 PASS
Cross-page compare 5 $0.21 PASS
Long chain: compare topics 7 $0.27 PASS
Long chain: pagination 7 $0.28 PASS
Long chain: follow author 9 $0.32 PASS
Long chain: read+summarize 8 $0.35 PASS
Long chain: 3-node visit 7 $0.21 PASS

SKILL.md Optimization Impact

  • Before: 21 turns for complex reply task ($0.68)
  • After: ~7 turns average ($0.28 average)
  • -67% turns, -59% cost

Files (14 new, 1 modified)

AutoResearch Framework

  • autoresearch/engine.ts — 8-phase Karpathy loop
  • autoresearch/config.ts — typed config + CLI parser
  • autoresearch/logger.ts — TSV results log
  • autoresearch/commands/run.ts — main autonomous loop
  • autoresearch/commands/plan.ts — interactive config wizard
  • autoresearch/commands/fix.ts — auto-detect and fix errors
  • autoresearch/commands/debug.ts — hypothesis-driven debugging

V2EX Test Suite

  • autoresearch/v2ex-tasks.json — 60 tasks (7 layers)
  • autoresearch/eval-v2ex.ts — runner with layer-based stats
  • autoresearch/presets/v2ex-reliability.ts — V2EX preset

SKILL.md

  • skills/opencli-operate/SKILL.md — chaining optimization

AutoResearch framework (Karpathy-style autonomous iteration):
- engine.ts: 8-phase loop (review → modify → commit → verify → guard → decide → log)
- config.ts: typed config + CLI parser + metric extraction
- logger.ts: TSV append-only results log
- commands/run.ts: main loop spawning Claude Code per iteration
- commands/plan.ts: interactive config wizard
- commands/fix.ts: auto-detect broken state, iteratively fix
- commands/debug.ts: hypothesis-driven debugging for failing tasks

V2EX test suite (5 layers, 40 tasks):
- L1 Atomic (10): open, state, click, scroll, eval, back, wait
- L2 Single Page (10): hot topics, node list, topic meta, pagination
- L3 Multi-Step (10): click-read, navigate-node, tab-then-topic, pagination
- L4 Write Ops (5): reply typing, favorite detection, form detection
- L5 Complex Chain (5): cross-page collect, multi-node compare, full workflow

Presets: operate-reliability, skill-quality, v2ex-reliability
- Fix v2ex-collect-hot-authors selector (pathname-based member link detection)
- Fix v2ex-wait-text judge (accept "appeared")
- Fix trailing commas in eval step strings
- Add 20 harder tasks: state+click interaction + long chain workflows
- Baseline: 60/60 across all layers
…e turns

- Add Rule #7: minimize total tool calls (3-5 per task, not 15-20)
- Strengthen Rule #5: chain aggressively with &&
- Add explicit good/bad chaining examples
- Add click+wait+state chaining pattern
- Add type+verify chaining pattern

Before: 21 turns for complex V2EX reply task
After: 12 turns for same task (-43% turns, -28% cost)
@jackwener jackwener merged commit 37f1b46 into main Apr 3, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant