Skip to content

[BUG] Bedrock Guardrail False Positive on Tool Results #1671

@peter-nagy1

Description

@peter-nagy1

Checks

  • I have updated to the lastest minor and patch version of Strands
  • I have checked the documentation and this is not expected behavior
  • I have searched ./issues and there are no duplicates of my issue

Strands Version

1.23.0

Python Version

3.12.12

Operating System

macOS 26.2

Installation Method

other

Steps to Reproduce

Summary

Bedrock Guardrails incorrectly flags tool results as prompt injection attacks on subsequent messages. Tool results are stored in the conversation history with role: "user", and when the guardrail evaluates the conversation on any follow-up message, it misidentifies previous tool results as malicious user input.

Key Finding: The issue occurs because tool results are stored with role: "user" in the conversation history. When the guardrail evaluates subsequent requests, it scans all role: "user" messages—including tool results—and flags certain patterns (e.g., "You are Test Admin User.") as prompt injection attacks.

Environment

  • SDK: AWS Strands Agents
  • Model: us.anthropic.claude-haiku-4-5-20251001-v1:0
  • Guardrail: Bedrock Guardrail with prompt attack detection enabled

Code snippet

"""Simple reproduction of guardrail false positive."""

import os

import boto3
import strands
from dotenv import load_dotenv
from strands import Agent
from strands.models.bedrock import BedrockModel

load_dotenv()

@strands.tool
def identity_tool() -> str:
    """Tool that returns identity-like text."""
    return "You are Test Admin User."

boto_session = boto3.Session()
bedrock_model = BedrockModel(
    boto_session=boto_session,
    model_id=os.getenv("BEDROCK_MODEL_ID_HAIKU_4_5"),
    streaming=False,
    guardrail_id=os.getenv("BEDROCK_GUARDRAIL"),
    guardrail_version="DRAFT",
    guardrail_trace="enabled",
    guardrail_redact_input=False
)

agent = Agent(
    model=bedrock_model,
    tools=[identity_tool],
)

print("--- Step 1 ---")

# Step 1: This triggers identity_tool, returns "You are Test Admin User."
agent("Call the identity tool")

print("\n\n--- Step 2 ---")

# Step 2: This fails with prompt attack detection
# Guardrail sees "You are Test Admin User." in history as role="user"
agent("Hi there")  # <-- FAILS HERE

print("\n\nCONVERSATION:")
for i, msg in enumerate(agent.messages):
    print(f"[{i}] role: {msg.get('role')}")
    print(f"    content: {msg.get('content')}")
    print()

What happens

  1. A tool returns innocuous text like "You are Test Admin User."
  2. This gets stored as role: "user" in conversation history
  3. User sends any follow-up message (e.g., "hi there")
  4. The guardrail evaluates the full conversation history
  5. The guardrail sees the previous tool result with role: "user" and flags it as a prompt injection attack
  6. The request is blocked

Expected Behavior

The guardrail should distinguish between actual user input and tool results when evaluating for prompt injection attacks.

Actual Behavior

The guardrail treats all role: "user" messages identically, regardless of whether they are:

  • Actual user input, or
  • Tool results returned by the system

Additional Context

Investigation & Attempted Fixes

  • Enabling guardrail_latest_message doesn’t prove to be successful, since looking at the Strands SDK's _format_bedrock_messages method, if the last message isn’t a text or an image, then nothing gets wrapped in guard content. In this case, the toolResult is neither text nor image. Thus, the whole conversation history is evaluated by the guardrail.
  • Disabling the guardrail when receiving tool results (via hooks) may pose security risks.
  • Since _format_bedrock_messages preserves existing guardContent, wrapping the toolResult with guardContent when the agent returns event.agent.messages and this way the guardrail will only evaluate the toolResult could work but this is not an ideal solution.
  • Disabling the built-in guardrail on the bedrock model, and instead, using boto3 to manually call ApplyGuardrail via hooks to fully control when and what the guardrail scans could work too but it would add extra code to maintain.

Possible Solution

No response

Related Issues

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions