You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Disclaimer:
My words have been formatted by an LLM, but I am a person.
The problem is simple: Auto does a solid job selecting Pro vs. Flash at the start of a user turn. But that selection is locked for the entire agentic loop. If Auto correctly escalates to Pro for initial reasoning and planning, Pro then stays hot for every subsequent inference round — tool calls, debugging, file writes, all of it — burning expensive tokens on execution work Flash would handle blind.
Prior art already exists. Goose's lead/worker pattern solves exactly this:
🦢 Lead model (e.g. Claude Opus, GPT-4o) handles the first N inference rounds — planning, architecture, complex reasoning
⚙️ Worker model (e.g. Claude Haiku, GPT-4o-mini) takes over for execution rounds — file edits, test runs, routine implementation
🛡️ Failure fallback — if the worker starts generating broken code or tool failures, the orchestrator automatically pulls the lead back in for a few recovery rounds, then re-delegates
Critically: same context window, same session history. The switch happens at the seam between inference rounds — after tool results return to the orchestrator but before the next LLM completion is dispatched. No forked contexts, no separate planner-executor sessions. The orchestrator just points the next API call at a different model endpoint.
This is the seam that already exists in any agentic loop:
User Prompt
→ [Auto selects Pro] Inference Round 1 (planning)
→ Tool calls execute → results return to orchestrator
→ 🎯 THIS IS WHERE YOU SWITCH TO FLASH
→ Inference Round 2 (execution)
→ Tool calls execute → results return
→ Inference Round N...
→ Final Response
The orchestrator has full control between every inference round. Goose exploits this. Gemini CLI's Auto mode currently doesn't — it makes one model selection and holds it for the entire loop.
What I'm proposing is lightweight. Not a full planner-executor architecture with separate contexts. Not subagents. Just: let Auto re-evaluate (or deterministically downshift) at inference-round boundaries within a single user turn. Pro plans, Flash executes, and if Flash fumbles, Pro steps back in.
The open question: I don't know Gemini CLI's internals well enough to say whether the agentic loop exposes this orchestrator seam cleanly, or whether the whole thing is delegated to the API as a monolithic streaming call with no natural injection point. If it's the former, this should be relatively straightforward to implement — Goose's Rust codebase is open source and the pattern is well-documented. If it's the latter, it's a deeper architectural ask.
Would love to hear if others feel this pain, and whether the Gemini team has considered inference-round-level model routing as a feature.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Disclaimer:
My words have been formatted by an LLM, but I am a person.
The problem is simple: Auto does a solid job selecting Pro vs. Flash at the start of a user turn. But that selection is locked for the entire agentic loop. If Auto correctly escalates to Pro for initial reasoning and planning, Pro then stays hot for every subsequent inference round — tool calls, debugging, file writes, all of it — burning expensive tokens on execution work Flash would handle blind.
Prior art already exists. Goose's lead/worker pattern solves exactly this:
🦢 Lead model (e.g. Claude Opus, GPT-4o) handles the first N inference rounds — planning, architecture, complex reasoning
⚙️ Worker model (e.g. Claude Haiku, GPT-4o-mini) takes over for execution rounds — file edits, test runs, routine implementation
🛡️ Failure fallback — if the worker starts generating broken code or tool failures, the orchestrator automatically pulls the lead back in for a few recovery rounds, then re-delegates
Critically: same context window, same session history. The switch happens at the seam between inference rounds — after tool results return to the orchestrator but before the next LLM completion is dispatched. No forked contexts, no separate planner-executor sessions. The orchestrator just points the next API call at a different model endpoint.
This is the seam that already exists in any agentic loop:
User Prompt
→ [Auto selects Pro] Inference Round 1 (planning)
→ Tool calls execute → results return to orchestrator
→ 🎯 THIS IS WHERE YOU SWITCH TO FLASH
→ Inference Round 2 (execution)
→ Tool calls execute → results return
→ Inference Round N...
→ Final Response
The orchestrator has full control between every inference round. Goose exploits this. Gemini CLI's Auto mode currently doesn't — it makes one model selection and holds it for the entire loop.
What I'm proposing is lightweight. Not a full planner-executor architecture with separate contexts. Not subagents. Just: let Auto re-evaluate (or deterministically downshift) at inference-round boundaries within a single user turn. Pro plans, Flash executes, and if Flash fumbles, Pro steps back in.
The open question: I don't know Gemini CLI's internals well enough to say whether the agentic loop exposes this orchestrator seam cleanly, or whether the whole thing is delegated to the API as a monolithic streaming call with no natural injection point. If it's the former, this should be relatively straightforward to implement — Goose's Rust codebase is open source and the pattern is well-documented. If it's the latter, it's a deeper architectural ask.
Would love to hear if others feel this pain, and whether the Gemini team has considered inference-round-level model routing as a feature.
Beta Was this translation helpful? Give feedback.
All reactions