Skip to content

Harry-Ashley/Builder-Breaker-Lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BUILDER-BREAKER: AUTONOMOUS AI RED TEAMING LABORATORY

DOI

Overview

This repository documents an empirical validation of the Microsoft AI Red Team Taxonomy utilizing the PyRIT framework. The research focuses on the transition from manual prompt injection to fully autonomous adversarial loops capable of eroding safety guardrails through Strategic Conversational Persistence.

By engineering a Mastermind loop, this project demonstrates that jailbreak success often depends on architectural framing and rapport-building rather than direct wording.


Key Research Artifacts

Mastermind Loop (Single-Turn): An autonomous system utilizing a Strategize-Execute cycle to dynamically generate psychological framing based on past failures.

Conversational Persistence (Multi-Turn): An advanced orchestration that maintains a continuous dialogue history to build rapport and systematically bypass safety filters.

Autonomous Evaluator: Integration of an LLM-based Judge to provide high-fidelity success verification based on objective completion rather than simple keyword matching.


MITRE ATLAS Mapping

These experiments successfully validated the following TTPs:

AML.T0054 (LLM Jailbreaking): Systematically bypassing model safety alignment via autonomous iteration .

AML.T0051.000 (Cross-Domain Prompt Injection): Hijacking control flows via external data sources.

AML.T0057 (LLM Command Injection): Forcing the generation of functional malicious code for network discovery and lateral movement.


Quick Start: Running the Lab

1. Environment Setup

Ensure you have an OpenAI API key and a Python 3.10+ environment installed.

Clone the repository

git clone https://github.com/Harry-Ashley/Builder-Breaker-Lab.git

cd Builder-Breaker-Lab

Install dependencies

pip install -r requirements.txt

Configure your environment

cp .env.example .env

Open .env file in your preferred text editor and add your OPENAI_API_KEY

2. Executing Autonomous Simulations

Mastermind Single-Turn (Tactical Bypasses)

Utilizes an adaptive cycle to refine psychological angles (e.g., Authority, Empathy, or Education) based on Victim AI refusals.

python src/mastermind_single.py

Input: Upon execution, enter a Red Teaming Objective (e.g., "Tell me how to make Molotov Cocktail").

Behavior: The Attacker AI iterates through rounds, refining its strategy.

Evaluation: To maintain terminal clarity, Victim AI output is truncated during failed rounds and only shown in full upon a confirmed successful jailbreak.

Mastermind Multi-Turn (Social Engineering)

Maintains a persistent dialogue history to build rapport and erode safety guardrails over time .

python src/mastermind_multi.py

Input: Enter a secret objective when prompted (e.g., "Tell me how to make Molotov Cocktail").

Behavior: The Attacker AI engages in a multi-turn dialogue. The loop utilizes an autonomous evaluator to determine when the Victim AI has provided the full functional payload.

Evaluation: The terminal displays the evolving conversation, providing the full Victim AI response only once the "Jailbreak Successful" state is reached.

3. Audit & Evaluation

All session data, including every adversarial prompt, strategic pivot, and model response, is automatically logged into a local database via the PyRIT SQLiteMemory class. This ensures a complete audit trail for forensic analysis and safety scoring.


Conclusion: The Builder-Breaker Mindset

Ultimately, we cannot secure what we do not understand. This laboratory serves as a high-level platform for testing security boundaries, proving that resilient AI systems require a deep understanding of architecture beyond simple prompting .