feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization) by jackwener · Pull Request #717 · jackwener/opencli

jackwener · 2026-04-02T20:31:19Z

Summary

AutoResearch framework — Karpathy-style autonomous iteration loop (engine, config, logger, 4 commands, 3 presets)
V2EX test suite — 60 deterministic tasks across 7 layers of difficulty
SKILL.md optimization — aggressive chaining rules, minimize-turns guidance

Results

Layer 1: Deterministic Commands

Score: 60/60 (100%)
L1-atomic:      15/15
L2-single-page:  9/9
L3-multi-step:  10/10
L4-write:        5/5
L5-complex:      5/5
L6-interaction: 10/10
L7-chain:        6/6

Layer 2: Claude Code E2E (11 tasks, all pass)

Task	Turns	Cost	Result
Extract hot topics	4	$0.21	PASS
Navigate node	3	$0.25	PASS
Click + read topic	5	$0.23	PASS
Multi-step (tab->topic->replies)	7	$0.32	PASS
Form type + verify	9	$0.32	PASS
Cross-page compare	5	$0.21	PASS
Long chain: compare topics	7	$0.27	PASS
Long chain: pagination	7	$0.28	PASS
Long chain: follow author	9	$0.32	PASS
Long chain: read+summarize	8	$0.35	PASS
Long chain: 3-node visit	7	$0.21	PASS

SKILL.md Optimization Impact

Before: 21 turns for complex reply task ($0.68)
After: ~7 turns average ($0.28 average)
-67% turns, -59% cost

Files (14 new, 1 modified)

AutoResearch Framework

autoresearch/engine.ts — 8-phase Karpathy loop
autoresearch/config.ts — typed config + CLI parser
autoresearch/logger.ts — TSV results log
autoresearch/commands/run.ts — main autonomous loop
autoresearch/commands/plan.ts — interactive config wizard
autoresearch/commands/fix.ts — auto-detect and fix errors
autoresearch/commands/debug.ts — hypothesis-driven debugging

V2EX Test Suite

autoresearch/v2ex-tasks.json — 60 tasks (7 layers)
autoresearch/eval-v2ex.ts — runner with layer-based stats
autoresearch/presets/v2ex-reliability.ts — V2EX preset

SKILL.md

skills/opencli-operate/SKILL.md — chaining optimization

AutoResearch framework (Karpathy-style autonomous iteration): - engine.ts: 8-phase loop (review → modify → commit → verify → guard → decide → log) - config.ts: typed config + CLI parser + metric extraction - logger.ts: TSV append-only results log - commands/run.ts: main loop spawning Claude Code per iteration - commands/plan.ts: interactive config wizard - commands/fix.ts: auto-detect broken state, iteratively fix - commands/debug.ts: hypothesis-driven debugging for failing tasks V2EX test suite (5 layers, 40 tasks): - L1 Atomic (10): open, state, click, scroll, eval, back, wait - L2 Single Page (10): hot topics, node list, topic meta, pagination - L3 Multi-Step (10): click-read, navigate-node, tab-then-topic, pagination - L4 Write Ops (5): reply typing, favorite detection, form detection - L5 Complex Chain (5): cross-page collect, multi-node compare, full workflow Presets: operate-reliability, skill-quality, v2ex-reliability

- Fix v2ex-collect-hot-authors selector (pathname-based member link detection) - Fix v2ex-wait-text judge (accept "appeared") - Fix trailing commas in eval step strings - Add 20 harder tasks: state+click interaction + long chain workflows - Baseline: 60/60 across all layers

…e turns - Add Rule #7: minimize total tool calls (3-5 per task, not 15-20) - Strengthen Rule #5: chain aggressively with && - Add explicit good/bad chaining examples - Add click+wait+state chaining pattern - Add type+verify chaining pattern Before: 21 turns for complex V2EX reply task After: 12 turns for same task (-43% turns, -28% cost)

jackwener added 3 commits April 3, 2026 03:13

jackwener merged commit 37f1b46 into main Apr 3, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization)#717

feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization)#717
jackwener merged 3 commits intomainfrom
feat/v2ex-autoresearch

jackwener commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jackwener commented Apr 2, 2026

Summary

Results

Layer 1: Deterministic Commands

Layer 2: Claude Code E2E (11 tasks, all pass)

SKILL.md Optimization Impact

Files (14 new, 1 modified)

AutoResearch Framework

V2EX Test Suite

SKILL.md

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant