Edge Multi-Modal RAG Tool for Machine Troubleshooting

Multimodal RAG on Edge is a tool to perform multi-modal searches, which allows image, text, or combination of both as the input query, within files using a multi-modal vector search engine Marqo, and to generate a readable response based on the search result with LLAVA (Microsoft' partnership Large Multimodal Model).

The tool can help on machine troubleshooting on-site and more use cases. In this sample, we demonstrate how to use the tool as industry copilot to help people on construction site quickly troubleshoot the excavators. During the construction operation, When the excavators is broken, it is usually time-consuming for the staff on site to find the solution from the technical manual and the past troubleshooting logs or contact technical specialist for help. The edge multi-modal RAG tool serves as the industry copilot helping the operator staff to quickly fix the problems.

The solution is independent from cloud services, and the vector search engine and LMM can be deployed to the edge device with either CPU or GPU.

This solution supports multi-modal query for images and texts. Find the text-based version of Edge RAG solution here: azure-edge-extensions-retrieval-augmented-generation.

Architecture

Multimodal RAG solution typically comprised with 2 processes: Indexing and Searching/Generation.

Indexing is the process of creating a vector representation of the data.

The dataset format we are using is a csv file with image and text contents. The image and text contents will be embedding into the multi-modal vector database. The vector database is a multi-modal vector search engine, which is used to store and search the multi-modal vectors. The multi-modal vector search engine is used to perform the multi-modal vector search based on the multi modal input query of image and text. The multimodal dataset structure and a typical multimodal index item are shown below:

Searching/Generation is the process of finding the most similar vectors to a given query vector, and then generate the response based on the query and search result.

The Edge Mulitmodal RAG tool is composed of 4 components accessible via Web UI application:

create_index: to create a new index in the multi-modal vector database.
delete_index: to delete an existing index from the multi-modal vector database.
upload_data: to upload a document which contains image and text contents to the multi-modal vector database. The document contents will be embedding into the vector database.
search_and_generate: to perform a multi-modal vector search based on the multi modal input query of image and text, and the response will be generated based on the search result. currently we use Marqo as the multi-modal vector search engine, and LLAVA as the Large Multimodal Model(LMM) to generate the response based on the search result.

Getting Started

Prerequisites

An Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
A Linux machine. The sample is tested on WSL2 Ubuntu 20.04 LTS.

Installation

Install docker engine on linux machine with the guide here
Install make and g++ for llava execution file c++ compilation

sudo apt-get update
sudo apt-get install make
sudo apt-get install g++

Create virtual environment. Make sure Anaconda is installed first.

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip

Quick Start

Pull the docker and run Marqo server on your local dev machine

docker rm -f marqo
docker pull marqoai/marqo:latest
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest

Download github repo LLAVA and follow the instructions to compile LLAVA executable llava-cli to the local path ./llava
```
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make llava-cli
```
Download llava model from here into your local path ./llama.cpp/models. You need to download the two model files:
- mmproj-model-f16.gguf
- ggml-model-q4_k.gguf

Download this repo to your local dev machine,

git clone <repo url>
cd azure-edge-extensions-retrieval-augmented-generation-multimodel/
pip install -r requirements.txt

Config the below parameters with your LlaVa model path in page_search_and_generate.py

LLAVA_EXEC_PATH = "../llava/llava-cli " MODEL_PATH = "../llava/models/ggml-model-q4_k.gguf" MMPROJ_PATH = "../llava/models/mmproj-model-f16.gguf"
Run the webUI server
```
cd src/
streamlit run page_edge_multimodal_rag.py
```
The browser will auto open the web UI page. If not, please open the browser and input the url http://localhost:8501.
Create an Azure Blob Storage account and upload your multimodal documents to the blob storage. Follow the instructions here.

For demo purpose, use ./data/demo_dataset.csv as the document to upload. The dataset contains machine troubleshooting guidance and image urls.

Remember to update the blob storage url in the file ./page_upload_data.py.
```
account_url = ""
sas_token = ""  # Replace with the SAS token from your URL
container_name = ""
blob_name = ""
local_file_path = "" # your local file path for the downloaded multimodal file
```
Use the web UI to perform the following operations:
- page-create-index: Input a new index name and create a new index in the multi-modal vector database.
- page-delete-index: Select an index name and delete it from the multi-modal vector database.
- page-upload-data: The demo code will automatically download the dataset from Azure Blob Storage. The dataset contains image and text contents to be embedding into the multi-modal vector database.
- page-search-and-generate: Input your query of text and image url, and input weights for both. Click search. The web app will send the query to the backend and get the response back.
The response time depends on the edge machine's specs and computing power. For 8core/32GB RAM CPU, It takes seconds for vector search and a few minutes for LLAVA. Choose larger machine size to speed up the response time.

Demo

The orignal demo video can be found here.

mm-rag-edge-10mb.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github		.github
data		data
images		images
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Edge Multi-Modal RAG Tool for Machine Troubleshooting

Architecture

Getting Started

Prerequisites

Installation

Quick Start

Demo

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Azure-Samples/azure-edge-extensions-retrieval-augmented-generation-multimodel

Folders and files

Latest commit

History

Repository files navigation

Edge Multi-Modal RAG Tool for Machine Troubleshooting

Architecture

Getting Started

Prerequisites

Installation

Quick Start

Demo

Resources

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages