Proposal: Add built-in tool_trajectory_f1 evaluator

## 🔴 Required Information

### Is your feature request related to a specific problem?

ADK's built-in trajectory evaluator (`tool_trajectory_in_order`) scores each invocation as binary pass/fail. This makes it hard to measure incremental improvements:

- an agent that calls 3 of 4 expected tools scores `0.0` — identical to an agent that called zero correct tools
- extra spurious tool calls collapse an otherwise correct trajectory to failure
- repeated tool names require one-to-one alignment, which greedy matching   handles incorrectly

The result is a noisy signal that is hard to use for regression tracking during agent development.

### Describe the Solution You'd Like

A new built-in evaluator named `tool_trajectory_f1` that scores tool-call trajectories with partial credit using F1 scoring.

- **Metric name:** `tool_trajectory_f1`
- **Criterion type:** `ToolTrajectoryF1Criterion(BaseCriterion)`
- **Match modes:** `name_only`, `name_and_args`, `name_and_required_args`
- **Alignment:** ordered (monotonic) and unordered (maximum-cardinality)
- Returns `NOT_EVALUATED` when reference invocations are unavailable

Per-invocation scoring (A = actual count, E = expected count, M = matched pairs):

- `A == 0 and E == 0` → `1.0`
- `A == 0 xor E == 0` → `0.0`
- Otherwise: `precision = M/A`, `recall = M/E`, `f1 = 2·P·R / (P+R)`

Case score is the mean of per-invocation F1 scores.

### Impact on your work

I am building an evaluation harness for ADK-based agents and need a stable, deterministic regression metric for tool-use quality. The binary scorer makes it difficult to distinguish agents that are nearly correct from agents that are completely wrong, which slows down iterative improvement.

No specific timeline — but this would unblock more reliable eval-driven development for any team using ADK.

### Willingness to contribute

Yes. I am willing to implement this and submit a focused PR with unit tests, following the contribution guidelines (CLA signed).

---

## 🟡 Recommended Information

### Describe Alternatives You've Considered

- **Custom metric function via `custom_metrics`:** works today but requires per-project boilerplate, is not discoverable, and cannot be referenced by name in eval set configs the way built-in metrics can.
- **Patching `tool_trajectory_in_order`:** would break existing eval sets that rely on binary semantics.

### Proposed API / Implementation

```python
# eval_metrics.py
class ToolTrajectoryF1Criterion(BaseCriterion):
    match_mode: Literal["name_only", "name_and_args", "name_and_required_args"] = "name_only"
    ordered: bool = True

# trajectory_f1_evaluator.py
class ToolTrajectoryF1Evaluator(Evaluator):
    criterion_type: ClassVar = ToolTrajectoryF1Criterion

    async def evaluate_invocations(
        self,
        actual_invocations,
        expected_invocations,
        conversation_scenario=None,
    ) -> EvaluationResult:
        ...
```

Registration in `_get_default_metric_evaluator_registry()`:

```python
MetricInfo(
    metric_name="tool_trajectory_f1",
    description="F1 score for tool-call trajectory matching.",
),
```

Usage in an eval set config:

```json
{
  "metric": "tool_trajectory_f1",
  "criterion": { "threshold": 0.8, "matchMode": "name_only" }
}
```

### Additional Context

Issue #4794 proposes adding `ignore_args` to the existing trajectory evaluator. `tool_trajectory_f1` is complementary — it addresses partial-credit scoring rather than argument filtering, and could itself benefit from an `ignore_args`-style match mode once #4794 lands.

I'd appreciate early feedback on one API question: is a separate `ToolTrajectoryF1Criterion` preferred, or should `ToolTrajectoryCriterion` be extended with an optional `scoring: Literal["binary", "f1"]` field?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add built-in tool_trajectory_f1 evaluator #5306

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Add built-in tool_trajectory_f1 evaluator #5306

Description

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions