Skip to content

Proposal: Add built-in tool_trajectory_f1 evaluator #5306

@fabrizioamort

Description

@fabrizioamort

🔴 Required Information

Is your feature request related to a specific problem?

ADK's built-in trajectory evaluator (tool_trajectory_in_order) scores each invocation as binary pass/fail. This makes it hard to measure incremental improvements:

  • an agent that calls 3 of 4 expected tools scores 0.0 — identical to an agent that called zero correct tools
  • extra spurious tool calls collapse an otherwise correct trajectory to failure
  • repeated tool names require one-to-one alignment, which greedy matching handles incorrectly

The result is a noisy signal that is hard to use for regression tracking during agent development.

Describe the Solution You'd Like

A new built-in evaluator named tool_trajectory_f1 that scores tool-call trajectories with partial credit using F1 scoring.

  • Metric name: tool_trajectory_f1
  • Criterion type: ToolTrajectoryF1Criterion(BaseCriterion)
  • Match modes: name_only, name_and_args, name_and_required_args
  • Alignment: ordered (monotonic) and unordered (maximum-cardinality)
  • Returns NOT_EVALUATED when reference invocations are unavailable

Per-invocation scoring (A = actual count, E = expected count, M = matched pairs):

  • A == 0 and E == 01.0
  • A == 0 xor E == 00.0
  • Otherwise: precision = M/A, recall = M/E, f1 = 2·P·R / (P+R)

Case score is the mean of per-invocation F1 scores.

Impact on your work

I am building an evaluation harness for ADK-based agents and need a stable, deterministic regression metric for tool-use quality. The binary scorer makes it difficult to distinguish agents that are nearly correct from agents that are completely wrong, which slows down iterative improvement.

No specific timeline — but this would unblock more reliable eval-driven development for any team using ADK.

Willingness to contribute

Yes. I am willing to implement this and submit a focused PR with unit tests, following the contribution guidelines (CLA signed).


🟡 Recommended Information

Describe Alternatives You've Considered

  • Custom metric function via custom_metrics: works today but requires per-project boilerplate, is not discoverable, and cannot be referenced by name in eval set configs the way built-in metrics can.
  • Patching tool_trajectory_in_order: would break existing eval sets that rely on binary semantics.

Proposed API / Implementation

# eval_metrics.py
class ToolTrajectoryF1Criterion(BaseCriterion):
    match_mode: Literal["name_only", "name_and_args", "name_and_required_args"] = "name_only"
    ordered: bool = True

# trajectory_f1_evaluator.py
class ToolTrajectoryF1Evaluator(Evaluator):
    criterion_type: ClassVar = ToolTrajectoryF1Criterion

    async def evaluate_invocations(
        self,
        actual_invocations,
        expected_invocations,
        conversation_scenario=None,
    ) -> EvaluationResult:
        ...

Registration in _get_default_metric_evaluator_registry():

MetricInfo(
    metric_name="tool_trajectory_f1",
    description="F1 score for tool-call trajectory matching.",
),

Usage in an eval set config:

{
  "metric": "tool_trajectory_f1",
  "criterion": { "threshold": 0.8, "matchMode": "name_only" }
}

Additional Context

Issue #4794 proposes adding ignore_args to the existing trajectory evaluator. tool_trajectory_f1 is complementary — it addresses partial-credit scoring rather than argument filtering, and could itself benefit from an ignore_args-style match mode once #4794 lands.

I'd appreciate early feedback on one API question: is a separate ToolTrajectoryF1Criterion preferred, or should ToolTrajectoryCriterion be extended with an optional scoring: Literal["binary", "f1"] field?

Metadata

Metadata

Labels

eval[Component] This issue is related to evaluationneeds review[Status] The PR/issue is awaiting review from the maintainer

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions