Operating AI Agents: Failure and Recovery

This is the repository for the LinkedIn Learning course Operating AI Agents: Failure and Recovery. The full course is available from LinkedIn Learning.

Course Description

As AI agents shift from experimentation to production, operational failures can create serious business risks. This intermediate course explores practical techniques for monitoring agent behavior, tracing execution paths, and identifying failure modes across single‑ and multi‑agent systems. Through hands-on GitHub Codespaces exercises, you learn how to implement rollback mechanisms, build automated recovery workflows, and create reports that surface agent health and system status in real time. By the end of the course, you’ll have the skills to improve the safety and predictability of AI agents in production, and to respond quickly and effectively when failures occur.

See the readme file in the main branch for updated instructions and information.

You’ll learn how to:

Detect and diagnose AI agent failures in production using monitoring, logging, and execution‑tracing techniques.
Analyze execution logs and system state to identify a failure, attribute the action to a specific agent and operation, and determine its scope and impact by comparing pre‑ and post‑action states.
Implement rollback and other recovery mechanisms that restore a known‑good system state after unintended or destructive agent actions.
Evaluate recovery success by validating restored state, confirming data integrity, and reviewing post‑recovery logs.
Build automated recovery workflows and operational reports that surface agent health, failures, and recovery actions in real time.

Notes

This course, Operating AI Agents: Failure and Recovery, is the second course in the governing AI agents series. The first course is Governing AI Agents: Visibility and Control.

Requirements

Python 3.9+
An OpenAI API key

Setup

Clone this repo (or download the files).

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate   # macOS/Linux
venv\Scripts\activate      # Windows

Install dependencies:
```
pip install -r requirements.txt
```

Set your OpenAI API key or place in .env file:

export OPENAI_API_KEY="your_api_key"      # macOS/Linux
setx OPENAI_API_KEY "your_api_key"        # Windows PowerShell

Instructor

Kesha Williams

Award-Winning Tech Innovator and AI/ML Leader

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
00_01_b		00_01_b
00_01_e		00_01_e
00_02_b		00_02_b
00_02_e		00_02_e
00_03_b		00_03_b
00_03_e		00_03_e
00_04_b		00_04_b
00_04_e		00_04_e
00_05_b		00_05_b
00_05_e		00_05_e
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Operating AI Agents: Failure and Recovery

Course Description

Notes

Requirements

Setup

Instructor

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

LinkedInLearning/operating-AI-agents-failure-and-recovery-8020004

Folders and files

Latest commit

History

Repository files navigation

Operating AI Agents: Failure and Recovery

Course Description

Notes

Requirements

Setup

Instructor

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages