This repository documents an empirical validation of the Microsoft AI Red Team Taxonomy utilizing the PyRIT framework. The research focuses on the transition from manual prompt injection to fully autonomous adversarial loops capable of eroding safety guardrails through Strategic Conversational Persistence.
By engineering a Mastermind loop, this project demonstrates that jailbreak success often depends on architectural framing and rapport-building rather than direct wording.
Mastermind Loop (Single-Turn): An autonomous system utilizing a Strategize-Execute cycle to dynamically generate psychological framing based on past failures.
Conversational Persistence (Multi-Turn): An advanced orchestration that maintains a continuous dialogue history to build rapport and systematically bypass safety filters.
Autonomous Evaluator: Integration of an LLM-based Judge to provide high-fidelity success verification based on objective completion rather than simple keyword matching.
These experiments successfully validated the following TTPs:
AML.T0054 (LLM Jailbreaking): Systematically bypassing model safety alignment via autonomous iteration .
AML.T0051.000 (Cross-Domain Prompt Injection): Hijacking control flows via external data sources.
AML.T0057 (LLM Command Injection): Forcing the generation of functional malicious code for network discovery and lateral movement.
Ensure you have an OpenAI API key and a Python 3.10+ environment installed.
git clone https://github.com/Harry-Ashley/Builder-Breaker-Lab.git
cd Builder-Breaker-Labpip install -r requirements.txtcp .env.example .envOpen .env file in your preferred text editor and add your OPENAI_API_KEY
Utilizes an adaptive cycle to refine psychological angles (e.g., Authority, Empathy, or Education) based on Victim AI refusals.
python src/mastermind_single.pyInput: Upon execution, enter a Red Teaming Objective (e.g., "Tell me how to make Molotov Cocktail").
Behavior: The Attacker AI iterates through rounds, refining its strategy.
Evaluation: To maintain terminal clarity, Victim AI output is truncated during failed rounds and only shown in full upon a confirmed successful jailbreak.
Maintains a persistent dialogue history to build rapport and erode safety guardrails over time .
python src/mastermind_multi.pyBehavior: The Attacker AI engages in a multi-turn dialogue. The loop utilizes an autonomous evaluator to determine when the Victim AI has provided the full functional payload.
Evaluation: The terminal displays the evolving conversation, providing the full Victim AI response only once the "Jailbreak Successful" state is reached.
All session data, including every adversarial prompt, strategic pivot, and model response, is automatically logged into a local database via the PyRIT SQLiteMemory class. This ensures a complete audit trail for forensic analysis and safety scoring.
Ultimately, we cannot secure what we do not understand. This laboratory serves as a high-level platform for testing security boundaries, proving that resilient AI systems require a deep understanding of architecture beyond simple prompting .