AI Guard Agent

Course: EE 726 - Advanced Machine Learning (IIT Bombay)
Instructor: Prof. Amit Sethi
Authors: V Sathvik, Aman Moon
Status: Active

📋 Abstract

The AI Guard Agent is an interactive security monitor developed as part of the Advanced Machine Learning curriculum. It integrates Automatic Speech Recognition (ASR), Facial Recognition, and Large Language Models (LLMs) to create a system that secures environments by actively monitoring audio-visual inputs and verifying user identities.

It operates via a robust multi-threaded Python backend, ensuring real-time performance by handling blocking I/O operations (such as video feeds and audio streams) concurrently. The system utilizes OpenAI's Whisper for transcription, Google's Gemma for conversational intelligence, and dlib-based biometric models for security.

🛠️ Tech Stack & Libraries

The following libraries were utilized to build the various modules of the system:

Library / Tool	Purpose & Implementation
Eel	Bridges the Python backend with the React/TypeScript frontend for asynchronous communication.
OpenAI Whisper	ASR (Automatic Speech Recognition): Converts raw audio input into text with high accuracy.
Google Gemma	LLM (Large Language Model): Provides the conversational intelligence (running on `torch` via Hugging Face).
dlib	Face Detection & Recognition: Uses HOG models to generate 128-d face encodings for verification.
PyTorch (CUDA)	Inference Engine: Accelerates the Gemma LLM and Whisper models using GPU computing.
pyttsx3	TTS (Text-to-Speech): Synthesizes the AI's text responses into spoken audio offline.
NumPy	Signal Processing: Performs Fast Fourier Transforms (FFT) to generate real-time audio visualizations.
FuzzyWuzzy / Jellyfish	Command Matching: Implements Levenshtein distance and phonetic matching for command recognition.
OpenCV	Video Capture: Manages the webcam feed and frame extraction for the vision thread.
React + TypeScript	Frontend UI: Provides a dynamic, type-safe user interface for the security dashboard.

🏗️ System Architecture & Multi-Threading

The core strength of the AI Guard Agent lies in its concurrent execution architecture. To ensure low latency and prevent the main application loop from freezing during heavy inference or I/O tasks, the system implements a multi-threaded architecture consisting of three primary worker threads:

1. Audio Processor Thread (Daemon)

Role: Handles continuous audio ingestion and processing.
Functionality:
- Captures raw PCM data at 16kHz.
- Performs Voice Activity Detection (VAD) to filter silence and ambient noise.
- Computes Fast Fourier Transform (FFT) for real-time frequency visualization on the UI.
- Pushes valid audio segments to the recognition queue.

2. Vision Engine Thread (Daemon)

Role: Manages the video feed and biometric verification.
Functionality:
- Operates on a Producer-Consumer model to decouple frame capture from processing.
- Uses dlib's HOG face detector to locate faces in real-time.
- Generates 128-dimensional face encodings and compares them against a trusted database using Euclidean distance.
- Updates the global is_verified state flag based on temporal smoothing.

3. Inference Engine Thread

Role: The "Brain" of the agent.
Functionality:
- Manages the Large Language Model (Google Gemma) context window.
- Receives transcribed text from the Audio thread.
- Generates context-aware responses (switching between "Assistant" and "Guard" personas).
- Synthesizes speech output using pyttsx3 (TTS).

🎤 Audio Subsystem Implementation

The audio logic prioritizes responsiveness and visualization.

Signal Processing: Raw audio is decomposed into frequency bins (50Hz - 16kHz). Logarithmic transformation is applied to magnitudes to approximate human loudness perception for the UI visualizer: $$\text{log(magnitudes)} = \ln(1 + \text{band(magnitudes)})$$
Voice Activity Detection (VAD): A dynamic silence threshold is established via calibration: $$T_{silence} = \max(100, A_{ambient_max} \times 1.2)$$
Command Recognition: Uses fuzzy string matching to detect activation keywords with >85% confidence.

👁️ Computer Vision Module

The vision module ensures that only authorized personnel can access the system's "Helpful" persona.

Enrollment: Captures trusted faces and stores their 128-d encodings.
Verification Logic: $$||E_{live} - E_{trusted}|| < 0.6$$ If the Euclidean distance is below 0.6, the face is verified.
Anti-Spoofing/Stability: Implementation includes a temporal buffer to prevent flickering decisions due to motion blur or lighting changes.

🧠 Generative AI Integration

The conversational core utilizes Google's Gemma model (google/gemma-3n-E4b-it), optimized for consumer hardware using torch.bfloat16 precision.

Context-Aware Prompt Engineering

The agent's behavior is dictated by dynamic prompt injection:

Verified State: "You are a helpful security assistant. Answer the user's queries concisely."
Unverified State (Guard Mode): > "Instructions: You are a guard AI. An unverified user is trying to talk to you. Politely but firmly, tell them to please kindly leave the room."

🚀 Installation & Usage

Prerequisites

Python 3.8+
Node.js & npm (Required for building the frontend)
CUDA-enabled GPU (Recommended for Gemma LLM)
Webcam and Microphone

Setup

Clone the repository:

git clone [https://github.com/amanmoon/AI_Guard_Agent.git](https://github.com/amanmoon/AI_Guard_Agent.git)
cd AI_Guard_Agent

Install Python dependencies:
```
pip install -r requirements.txt
```
Install Node dependencies: Navigate to the root directory where package.json is located and install the frontend packages.
```
npm install
```
Build the Frontend: This compiles the React/TypeScript code into static files that Eel can serve.
```
npm run build
```
Run the Application: Once the build is complete, start the Python backend.
```
python eelApplication.py
```

🔄 Integration Flow

Idle Mode: System listens for the specific "Activate" keyword.
Active Mode: * Vision Thread continuously checks for faces.
- Audio Thread captures user speech.
- Whisper transcribes speech -> Text.
- LLM checks is_verified flag -> Generates Response.
- TTS speaks the response back to the user.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
__pycache__		__pycache__
public		public
pyfiles		pyfiles
src		src
.gitignore		.gitignore
README.md		README.md
eelApplication.py		eelApplication.py
eelConfig.txt		eelConfig.txt
package-lock.json		package-lock.json
package.json		package.json
requirement.txt		requirement.txt
requirements.txt		requirements.txt
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Guard Agent

📋 Abstract

🛠️ Tech Stack & Libraries

🏗️ System Architecture & Multi-Threading

1. Audio Processor Thread (Daemon)

2. Vision Engine Thread (Daemon)

3. Inference Engine Thread

🎤 Audio Subsystem Implementation

👁️ Computer Vision Module

🧠 Generative AI Integration

Context-Aware Prompt Engineering

🚀 Installation & Usage

Prerequisites

Setup

🔄 Integration Flow

About

Uh oh!

Releases

Packages

Languages

amanmoon/AI_Guard_Agent

Folders and files

Latest commit

History

Repository files navigation

AI Guard Agent

📋 Abstract

🛠️ Tech Stack & Libraries

🏗️ System Architecture & Multi-Threading

1. Audio Processor Thread (Daemon)

2. Vision Engine Thread (Daemon)

3. Inference Engine Thread

🎤 Audio Subsystem Implementation

👁️ Computer Vision Module

🧠 Generative AI Integration

Context-Aware Prompt Engineering

🚀 Installation & Usage

Prerequisites

Setup

🔄 Integration Flow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages