Advanced Multi-Source RAG Chatbot

Welcome to the Advanced Multi-Source RAG Chatbot project! This is a Retrieval-Augmented Generation (RAG) based chatbot that leverages multiple data sources—PDFs, YouTube transcripts, web searches, and Wikipedia—to provide accurate and context-rich responses. Built with state-of-the-art tools like Mistral-7B, LangChain, and Gradio, this project offers both a command-line interface and a web-based UI for seamless interaction.

Project Overview

The Advanced Multi-Source RAG Chatbot is designed to answer queries by retrieving information from multiple sources and generating concise responses using a quantized Mistral-7B model. It integrates document search (PDFs), YouTube transcript extraction, web search (via Tavily), and Wikipedia summaries, making it a versatile tool for research, education, and general knowledge exploration.

Features

Multi-Source Retrieval: Extracts information from uploaded PDFs, YouTube video descriptions, web searches, and Wikipedia.
Quantized LLM: Uses a 4-bit quantized Mistral-7B model for efficient inference on GPUs.
Conversation Memory: Maintains chat history for context-aware responses.
Web Interface: Built with Gradio for an intuitive user experience.
Modular Design: Organized into separate Python files for easy maintenance and scalability.

Directory Structure

Advanced_Multi_Source_RAG_Chatbot/
├── requirements.txt              # Project dependencies
├── main.py                       # Console-based inference script
├── model_config.py               # Model loading and configuration
├── data_processing.py            # PDF loading and vector store creation
├── retrieval.py                  # Multi-source retrieval functions
├── generation.py                 # Response generation and memory management
├── app.py                        # Gradio web interface
├── images/                       # Example screenshots
│   ├── pdf_example.png           # Screenshot of PDF-based response
│   ├── text_example.png          # Screenshot of text query response
│   └── youtube_example.png       # Screenshot of YouTube-based response
├── Advanced_Multi_Source_RAG_Chatbot.ipynb  # Colab notebook with full execution
├── LICENSE                       # MIT License
└── README.md                     # Project documentation

Prerequisites

Python: Version 3.8 or higher
GPU: Recommended for faster inference (Mistral-7B runs on CUDA)
API Tokens: Hugging Face token and Tavily API key
YouTube Cookies: Required for transcript retrieval
Internet Connection: For web search and YouTube/Wikipedia retrieval

Setup Instructions

Install Dependencies

Clone the repository:

git clone https://github.com/HimadeepRagiri/Advanced_Multi_Source_RAG_Chatbot.git
cd Advanced_Multi_Source_RAG_Chatbot

Install the required packages:
```
pip install -r requirements.txt
```

Obtain API Tokens

This project requires two API tokens to function fully:

Hugging Face Token

Sign up or log in to Hugging Face.
Go to your profile > Settings > Access Tokens.
Generate a new token (e.g., hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx).

Add the token to your environment or script:

from huggingface_hub import login
token = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"  # Replace with your token
login(token)

Tavily API Key

Sign up at Tavily.
Get your API key from the dashboard (e.g., tvly-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx).

Set it as an environment variable:

import os
os.environ["TAVILY_API_KEY"] = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"  # Replace with your key

Alternatively, export it in your terminal:

export TAVILY_API_KEY="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Important: Replace the example tokens with your own. Do not share your tokens publicly!

Generate YouTube Cookies

To retrieve YouTube transcripts (descriptions), you need a cookies.txt file:

Install a Browser Extension:
- Use a Chrome/Firefox extension like "Get cookies.txt" (e.g., Chrome Extension).
Export Cookies:
- Visit YouTube, log in (optional), and export cookies using the extension.
- Save the file as cookies.txt.
Place the File:
- Move cookies.txt to the root directory of the project (Advanced_Multi_Source_RAG_Chatbot/).
- The retrieval script expects it at /content/cookies.txt in Colab; adjust the path in retrieval.py if running locally (e.g., ./cookies.txt).

Note: Without cookies.txt, YouTube transcript retrieval will fail silently or return an error.

Running the Project

Console Mode

Run the console-based inference:

python main.py

This executes a sample query: "What is Low Rank Adaptation in the context of machine learning models?"
Outputs retrieved sources, response, and conversation memory.

Web Interface

Launch the Gradio web UI:

python app.py

Open the provided URL (e.g., http://127.0.0.1:7860) in your browser.
Input a query, optionally upload a PDF or provide a YouTube URL, and click "Generate Response".

Colab Notebook

Open Advanced_Multi_Source_RAG_Chatbot.ipynb in Google Colab.

Upload cookies.txt to the Colab environment:

from google.colab import files
files.upload()  # Upload cookies.txt

Run all cells to execute the full project, including the Gradio interface.

Usage Examples

Text Query: "What is Low Rank Adaptation in machine learning?"
- Enter the query in the text box and click "Generate Response".
PDF Upload: Upload a PDF about machine learning, then ask a related question.
YouTube URL: Provide a URL (e.g., https://www.youtube.com/watch?v=example) to extract its description for the response.

See Screenshots for visual examples.

Screenshots

Here are example outputs from the Gradio interface:

PDF-Based Response
Text Query Response
YouTube-Based Response

Project Components

Files

main.py: Console-based inference script with a sample query.
model_config.py: Loads the Mistral-7B model, tokenizer, embeddings, and Tavily search tool.
data_processing.py: Handles PDF loading and FAISS vector store creation.
retrieval.py: Retrieves data from PDFs, web, Wikipedia, and YouTube.
generation.py: Generates responses using the LLM and manages conversation memory.
app.py: Implements the Gradio web interface.

Technologies

Mistral-7B: 4-bit quantized LLM for response generation.
LangChain: For embeddings, vector stores, and memory management.
FAISS: Vector store for PDF document search.
Tavily: Web search API.
Gradio: Web UI framework.
yt-dlp: YouTube transcript extraction.

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a feature branch (git checkout -b feature/your-feature).
Commit your changes (git commit -m "Add your feature").
Push to the branch (git push origin feature/your-feature).
Open a Pull Request.

Please ensure your code follows the existing structure and includes appropriate documentation.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Advanced Multi-Source RAG Chatbot

Table of Contents

Project Overview

Features

Directory Structure

Prerequisites

Setup Instructions

Install Dependencies

Obtain API Tokens

Hugging Face Token

Tavily API Key

Generate YouTube Cookies

Running the Project

Console Mode

Web Interface

Colab Notebook

Usage Examples

Screenshots

Project Components

Files

Technologies

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
Advanced_Multi_Source_RAG_Chatbot.ipynb		Advanced_Multi_Source_RAG_Chatbot.ipynb
LICENSE		LICENSE
README.md		README.md
app.py		app.py
data_processing.py		data_processing.py
generation.py		generation.py
main.py		main.py
model_config.py		model_config.py
requirements.txt		requirements.txt
retrieval.py		retrieval.py

License

HimadeepRagiri/Advanced_Multi_Source_RAG_Chatbot

Folders and files

Latest commit

History

Repository files navigation

Advanced Multi-Source RAG Chatbot

Table of Contents

Project Overview

Features

Directory Structure

Prerequisites

Setup Instructions

Install Dependencies

Obtain API Tokens

Hugging Face Token

Tavily API Key

Generate YouTube Cookies

Running the Project

Console Mode

Web Interface

Colab Notebook

Usage Examples

Screenshots

Project Components

Files

Technologies

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages