test(evals): add behavioral tests for tool output masking by NTaylorMullen · Pull Request #19172 · google-gemini/gemini-cli

NTaylorMullen · 2026-02-15T20:38:47Z

Summary

This PR adds behavioral evaluations to test the agent's ability to handle masked tool outputs. These tests ensure the agent can correctly identify when a large output has been redirected to a file and chooses to read it (or not) based on the information available in the preview.

Details

Added evals/tool_output_masking.eval.ts using the standard evalTest framework.
Included two main scenarios:
1. Battle Test: Verifies the agent extracts a file path from a <tool_output_masked> tag and uses a tool (like read_file) to retrieve the full content.
2. Efficiency Test: Verifies the agent avoids unnecessary file reads when the required information is already present in the preview snippet.
Uses the USUALLY_PASSES policy to align with existing behavioral evaluations.
Verified with a 100% pass rate over 10 runs using gemini-3-flash-preview.

Related Issues

N/A

How to Validate

Run the new evaluations using:

RUN_EVALS=true npx vitest -c evals/vitest.config.ts evals/tool_output_masking.eval.ts --run

Pre-Merge Checklist

- Verifies agent can read redirected large tool outputs from files\n- Verifies agent avoids redundant reads when information is in the preview\n- Uses standard evalTest framework with USUALLY_PASSES policy

gemini-code-assist · 2026-02-15T20:39:00Z

Summary of Changes

Hello @NTaylorMullen, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new suite of behavioral evaluations designed to enhance the agent's intelligence in processing large tool outputs that are initially masked. These tests are crucial for ensuring the agent can intelligently determine when to access full output files versus when to rely on preview information, thereby optimizing its decision-making and operational efficiency.

Highlights

New Behavioral Evaluations: Added evals/tool_output_masking.eval.ts to introduce behavioral evaluations for testing the agent's ability to handle masked tool outputs effectively.
Battle Test Scenario: Implemented a test case that verifies the agent correctly identifies a masked tool output and uses a tool (e.g., read_file) to retrieve the full content from a redirected file.
Efficiency Test Scenario: Included a test case to ensure the agent avoids unnecessary file reads when the required information is already available within the preview snippet of a masked output.

Changelog

evals/tool_output_masking.eval.ts
- Added a new test file for tool output masking behavioral evaluations.
- Implemented a findDir helper function to locate project directories.
- Included a 'Battle Test' scenario to verify the agent reads masked full output files.
- Included an 'Efficiency Test' scenario to verify the agent avoids unnecessary reads when information is in the preview.

Activity

New behavioral evaluations were added to test the agent's handling of masked tool outputs.
The new evaluations passed with a 100% success rate over 10 runs using gemini-3-flash-preview.
The pre-merge checklist indicates that tests were added/updated and validated on MacOS.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR adds two useful behavioral tests for tool output masking. The tests are well-structured, covering both the case where the agent needs to read a masked file and the case where it should not.

I've found a couple of issues in the test implementation that should be addressed:

A critical issue where the folder trust mechanism is not correctly configured or enabled, causing the test to pass for the wrong reasons.
A high-severity issue where an assertion relies on incorrect usage of JSON.stringify, making it fragile.

Fixing these will make the new tests more robust and ensure they are correctly validating the agent's behavior under security constraints.

evals/tool_output_masking.eval.ts

github-actions · 2026-02-15T20:42:16Z

Size Change: -2 B (0%)

Total Size: 24.4 MB

ℹ️ View Unchanged

Filename	Size	Change
`./bundle/gemini.js`	24.4 MB	-2 B (0%)
`./bundle/sandbox-macos-permissive-open.sb`	890 B	0 B
`./bundle/sandbox-macos-permissive-proxied.sb`	1.31 kB	0 B
`./bundle/sandbox-macos-restrictive-open.sb`	3.36 kB	0 B
`./bundle/sandbox-macos-restrictive-proxied.sb`	3.56 kB	0 B
`./bundle/sandbox-macos-strict-open.sb`	4.82 kB	0 B
`./bundle/sandbox-macos-strict-proxied.sb`	5.02 kB	0 B

_{compressed-size-action}

- Enable folder trust security in test params - Update trusted folder path to use rig.homeDir - Fix fragile tool log assertion by removing unnecessary JSON.stringify

test(evals): add behavioral tests for tool output masking

dcbfa08

- Verifies agent can read redirected large tool outputs from files\n- Verifies agent avoids redundant reads when information is in the preview\n- Uses standard evalTest framework with USUALLY_PASSES policy

NTaylorMullen requested a review from a team as a code owner February 15, 2026 20:38

gemini-code-assist bot reviewed Feb 15, 2026

View reviewed changes

evals/tool_output_masking.eval.ts Show resolved Hide resolved

evals/tool_output_masking.eval.ts Show resolved Hide resolved

evals/tool_output_masking.eval.ts Show resolved Hide resolved

gemini-cli bot added the status/need-issue Pull requests that need to have an associated issue. label Feb 15, 2026

test(evals): address review comments for tool output masking tests

651a16e

- Enable folder trust security in test params - Update trusted folder path to use rig.homeDir - Fix fragile tool log assertion by removing unnecessary JSON.stringify

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(evals): add behavioral tests for tool output masking#19172

test(evals): add behavioral tests for tool output masking#19172
NTaylorMullen wants to merge 2 commits intomainfrom
ntm/eval-tool-output-masking

NTaylorMullen commented Feb 15, 2026

Uh oh!

gemini-code-assist bot commented Feb 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NTaylorMullen commented Feb 15, 2026

Summary

Details

Related Issues

How to Validate

Pre-Merge Checklist

Uh oh!

gemini-code-assist bot commented Feb 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Feb 15, 2026 •

edited

Loading