🔴 Required Information
Is your feature request related to a specific problem?
ADK's built-in trajectory evaluator (tool_trajectory_in_order) scores each invocation as binary pass/fail. This makes it hard to measure incremental improvements:
- an agent that calls 3 of 4 expected tools scores
0.0 — identical to an agent that called zero correct tools
- extra spurious tool calls collapse an otherwise correct trajectory to failure
- repeated tool names require one-to-one alignment, which greedy matching handles incorrectly
The result is a noisy signal that is hard to use for regression tracking during agent development.
Describe the Solution You'd Like
A new built-in evaluator named tool_trajectory_f1 that scores tool-call trajectories with partial credit using F1 scoring.
- Metric name:
tool_trajectory_f1
- Criterion type:
ToolTrajectoryF1Criterion(BaseCriterion)
- Match modes:
name_only, name_and_args, name_and_required_args
- Alignment: ordered (monotonic) and unordered (maximum-cardinality)
- Returns
NOT_EVALUATED when reference invocations are unavailable
Per-invocation scoring (A = actual count, E = expected count, M = matched pairs):
A == 0 and E == 0 → 1.0
A == 0 xor E == 0 → 0.0
- Otherwise:
precision = M/A, recall = M/E, f1 = 2·P·R / (P+R)
Case score is the mean of per-invocation F1 scores.
Impact on your work
I am building an evaluation harness for ADK-based agents and need a stable, deterministic regression metric for tool-use quality. The binary scorer makes it difficult to distinguish agents that are nearly correct from agents that are completely wrong, which slows down iterative improvement.
No specific timeline — but this would unblock more reliable eval-driven development for any team using ADK.
Willingness to contribute
Yes. I am willing to implement this and submit a focused PR with unit tests, following the contribution guidelines (CLA signed).
🟡 Recommended Information
Describe Alternatives You've Considered
- Custom metric function via
custom_metrics: works today but requires per-project boilerplate, is not discoverable, and cannot be referenced by name in eval set configs the way built-in metrics can.
- Patching
tool_trajectory_in_order: would break existing eval sets that rely on binary semantics.
Proposed API / Implementation
# eval_metrics.py
class ToolTrajectoryF1Criterion(BaseCriterion):
match_mode: Literal["name_only", "name_and_args", "name_and_required_args"] = "name_only"
ordered: bool = True
# trajectory_f1_evaluator.py
class ToolTrajectoryF1Evaluator(Evaluator):
criterion_type: ClassVar = ToolTrajectoryF1Criterion
async def evaluate_invocations(
self,
actual_invocations,
expected_invocations,
conversation_scenario=None,
) -> EvaluationResult:
...
Registration in _get_default_metric_evaluator_registry():
MetricInfo(
metric_name="tool_trajectory_f1",
description="F1 score for tool-call trajectory matching.",
),
Usage in an eval set config:
{
"metric": "tool_trajectory_f1",
"criterion": { "threshold": 0.8, "matchMode": "name_only" }
}
Additional Context
Issue #4794 proposes adding ignore_args to the existing trajectory evaluator. tool_trajectory_f1 is complementary — it addresses partial-credit scoring rather than argument filtering, and could itself benefit from an ignore_args-style match mode once #4794 lands.
I'd appreciate early feedback on one API question: is a separate ToolTrajectoryF1Criterion preferred, or should ToolTrajectoryCriterion be extended with an optional scoring: Literal["binary", "f1"] field?
🔴 Required Information
Is your feature request related to a specific problem?
ADK's built-in trajectory evaluator (
tool_trajectory_in_order) scores each invocation as binary pass/fail. This makes it hard to measure incremental improvements:0.0— identical to an agent that called zero correct toolsThe result is a noisy signal that is hard to use for regression tracking during agent development.
Describe the Solution You'd Like
A new built-in evaluator named
tool_trajectory_f1that scores tool-call trajectories with partial credit using F1 scoring.tool_trajectory_f1ToolTrajectoryF1Criterion(BaseCriterion)name_only,name_and_args,name_and_required_argsNOT_EVALUATEDwhen reference invocations are unavailablePer-invocation scoring (A = actual count, E = expected count, M = matched pairs):
A == 0 and E == 0→1.0A == 0 xor E == 0→0.0precision = M/A,recall = M/E,f1 = 2·P·R / (P+R)Case score is the mean of per-invocation F1 scores.
Impact on your work
I am building an evaluation harness for ADK-based agents and need a stable, deterministic regression metric for tool-use quality. The binary scorer makes it difficult to distinguish agents that are nearly correct from agents that are completely wrong, which slows down iterative improvement.
No specific timeline — but this would unblock more reliable eval-driven development for any team using ADK.
Willingness to contribute
Yes. I am willing to implement this and submit a focused PR with unit tests, following the contribution guidelines (CLA signed).
🟡 Recommended Information
Describe Alternatives You've Considered
custom_metrics: works today but requires per-project boilerplate, is not discoverable, and cannot be referenced by name in eval set configs the way built-in metrics can.tool_trajectory_in_order: would break existing eval sets that rely on binary semantics.Proposed API / Implementation
Registration in
_get_default_metric_evaluator_registry():Usage in an eval set config:
{ "metric": "tool_trajectory_f1", "criterion": { "threshold": 0.8, "matchMode": "name_only" } }Additional Context
Issue #4794 proposes adding
ignore_argsto the existing trajectory evaluator.tool_trajectory_f1is complementary — it addresses partial-credit scoring rather than argument filtering, and could itself benefit from anignore_args-style match mode once #4794 lands.I'd appreciate early feedback on one API question: is a separate
ToolTrajectoryF1Criterionpreferred, or shouldToolTrajectoryCriterionbe extended with an optionalscoring: Literal["binary", "f1"]field?