test(evals): add behavioral tests for tool output masking#19172
test(evals): add behavioral tests for tool output masking#19172NTaylorMullen wants to merge 2 commits intomainfrom
Conversation
- Verifies agent can read redirected large tool outputs from files\n- Verifies agent avoids redundant reads when information is in the preview\n- Uses standard evalTest framework with USUALLY_PASSES policy
Summary of ChangesHello @NTaylorMullen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new suite of behavioral evaluations designed to enhance the agent's intelligence in processing large tool outputs that are initially masked. These tests are crucial for ensuring the agent can intelligently determine when to access full output files versus when to rely on preview information, thereby optimizing its decision-making and operational efficiency. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR adds two useful behavioral tests for tool output masking. The tests are well-structured, covering both the case where the agent needs to read a masked file and the case where it should not.
I've found a couple of issues in the test implementation that should be addressed:
- A critical issue where the folder trust mechanism is not correctly configured or enabled, causing the test to pass for the wrong reasons.
- A high-severity issue where an assertion relies on incorrect usage of
JSON.stringify, making it fragile.
Fixing these will make the new tests more robust and ensure they are correctly validating the agent's behavior under security constraints.
|
Size Change: -2 B (0%) Total Size: 24.4 MB ℹ️ View Unchanged
|
- Enable folder trust security in test params - Update trusted folder path to use rig.homeDir - Fix fragile tool log assertion by removing unnecessary JSON.stringify
Summary
This PR adds behavioral evaluations to test the agent's ability to handle masked tool outputs. These tests ensure the agent can correctly identify when a large output has been redirected to a file and chooses to read it (or not) based on the information available in the preview.
Details
evals/tool_output_masking.eval.tsusing the standardevalTestframework.<tool_output_masked>tag and uses a tool (likeread_file) to retrieve the full content.USUALLY_PASSESpolicy to align with existing behavioral evaluations.gemini-3-flash-preview.Related Issues
N/A
How to Validate
Run the new evaluations using:
Pre-Merge Checklist