feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization)#717
Merged
feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization)#717
Conversation
AutoResearch framework (Karpathy-style autonomous iteration): - engine.ts: 8-phase loop (review → modify → commit → verify → guard → decide → log) - config.ts: typed config + CLI parser + metric extraction - logger.ts: TSV append-only results log - commands/run.ts: main loop spawning Claude Code per iteration - commands/plan.ts: interactive config wizard - commands/fix.ts: auto-detect broken state, iteratively fix - commands/debug.ts: hypothesis-driven debugging for failing tasks V2EX test suite (5 layers, 40 tasks): - L1 Atomic (10): open, state, click, scroll, eval, back, wait - L2 Single Page (10): hot topics, node list, topic meta, pagination - L3 Multi-Step (10): click-read, navigate-node, tab-then-topic, pagination - L4 Write Ops (5): reply typing, favorite detection, form detection - L5 Complex Chain (5): cross-page collect, multi-node compare, full workflow Presets: operate-reliability, skill-quality, v2ex-reliability
- Fix v2ex-collect-hot-authors selector (pathname-based member link detection) - Fix v2ex-wait-text judge (accept "appeared") - Fix trailing commas in eval step strings - Add 20 harder tasks: state+click interaction + long chain workflows - Baseline: 60/60 across all layers
…e turns - Add Rule #7: minimize total tool calls (3-5 per task, not 15-20) - Strengthen Rule #5: chain aggressively with && - Add explicit good/bad chaining examples - Add click+wait+state chaining pattern - Add type+verify chaining pattern Before: 21 turns for complex V2EX reply task After: 12 turns for same task (-43% turns, -28% cost)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Results
Layer 1: Deterministic Commands
Layer 2: Claude Code E2E (11 tasks, all pass)
SKILL.md Optimization Impact
Files (14 new, 1 modified)
AutoResearch Framework
autoresearch/engine.ts— 8-phase Karpathy loopautoresearch/config.ts— typed config + CLI parserautoresearch/logger.ts— TSV results logautoresearch/commands/run.ts— main autonomous loopautoresearch/commands/plan.ts— interactive config wizardautoresearch/commands/fix.ts— auto-detect and fix errorsautoresearch/commands/debug.ts— hypothesis-driven debuggingV2EX Test Suite
autoresearch/v2ex-tasks.json— 60 tasks (7 layers)autoresearch/eval-v2ex.ts— runner with layer-based statsautoresearch/presets/v2ex-reliability.ts— V2EX presetSKILL.md
skills/opencli-operate/SKILL.md— chaining optimization