Welcome to the Advanced Multi-Source RAG Chatbot project! This is a Retrieval-Augmented Generation (RAG) based chatbot that leverages multiple data sources—PDFs, YouTube transcripts, web searches, and Wikipedia—to provide accurate and context-rich responses. Built with state-of-the-art tools like Mistral-7B, LangChain, and Gradio, this project offers both a command-line interface and a web-based UI for seamless interaction.
- Project Overview
- Features
- Directory Structure
- Prerequisites
- Setup Instructions
- Running the Project
- Usage Examples
- Screenshots
- Project Components
- Contributing
- License
The Advanced Multi-Source RAG Chatbot is designed to answer queries by retrieving information from multiple sources and generating concise responses using a quantized Mistral-7B model. It integrates document search (PDFs), YouTube transcript extraction, web search (via Tavily), and Wikipedia summaries, making it a versatile tool for research, education, and general knowledge exploration.
- Multi-Source Retrieval: Extracts information from uploaded PDFs, YouTube video descriptions, web searches, and Wikipedia.
- Quantized LLM: Uses a 4-bit quantized Mistral-7B model for efficient inference on GPUs.
- Conversation Memory: Maintains chat history for context-aware responses.
- Web Interface: Built with Gradio for an intuitive user experience.
- Modular Design: Organized into separate Python files for easy maintenance and scalability.
Advanced_Multi_Source_RAG_Chatbot/
├── requirements.txt # Project dependencies
├── main.py # Console-based inference script
├── model_config.py # Model loading and configuration
├── data_processing.py # PDF loading and vector store creation
├── retrieval.py # Multi-source retrieval functions
├── generation.py # Response generation and memory management
├── app.py # Gradio web interface
├── images/ # Example screenshots
│ ├── pdf_example.png # Screenshot of PDF-based response
│ ├── text_example.png # Screenshot of text query response
│ └── youtube_example.png # Screenshot of YouTube-based response
├── Advanced_Multi_Source_RAG_Chatbot.ipynb # Colab notebook with full execution
├── LICENSE # MIT License
└── README.md # Project documentation
- Python: Version 3.8 or higher
- GPU: Recommended for faster inference (Mistral-7B runs on CUDA)
- API Tokens: Hugging Face token and Tavily API key
- YouTube Cookies: Required for transcript retrieval
- Internet Connection: For web search and YouTube/Wikipedia retrieval
- Clone the repository:
git clone https://github.com/HimadeepRagiri/Advanced_Multi_Source_RAG_Chatbot.git cd Advanced_Multi_Source_RAG_Chatbot - Install the required packages:
pip install -r requirements.txt
This project requires two API tokens to function fully:
- Sign up or log in to Hugging Face.
- Go to your profile > Settings > Access Tokens.
- Generate a new token (e.g.,
hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx). - Add the token to your environment or script:
from huggingface_hub import login token = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Replace with your token login(token)
- Sign up at Tavily.
- Get your API key from the dashboard (e.g.,
tvly-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx). - Set it as an environment variable:
Alternatively, export it in your terminal:
import os os.environ["TAVILY_API_KEY"] = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Replace with your key
export TAVILY_API_KEY="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Important: Replace the example tokens with your own. Do not share your tokens publicly!
To retrieve YouTube transcripts (descriptions), you need a cookies.txt file:
- Install a Browser Extension:
- Use a Chrome/Firefox extension like "Get cookies.txt" (e.g., Chrome Extension).
- Export Cookies:
- Visit YouTube, log in (optional), and export cookies using the extension.
- Save the file as
cookies.txt.
- Place the File:
- Move
cookies.txtto the root directory of the project (Advanced_Multi_Source_RAG_Chatbot/). - The retrieval script expects it at
/content/cookies.txtin Colab; adjust the path inretrieval.pyif running locally (e.g.,./cookies.txt).
- Move
Note: Without
cookies.txt, YouTube transcript retrieval will fail silently or return an error.
Run the console-based inference:
python main.py- This executes a sample query: "What is Low Rank Adaptation in the context of machine learning models?"
- Outputs retrieved sources, response, and conversation memory.
Launch the Gradio web UI:
python app.py- Open the provided URL (e.g.,
http://127.0.0.1:7860) in your browser. - Input a query, optionally upload a PDF or provide a YouTube URL, and click "Generate Response".
- Open
Advanced_Multi_Source_RAG_Chatbot.ipynbin Google Colab. - Upload
cookies.txtto the Colab environment:from google.colab import files files.upload() # Upload cookies.txt
- Run all cells to execute the full project, including the Gradio interface.
- Text Query: "What is Low Rank Adaptation in machine learning?"
- Enter the query in the text box and click "Generate Response".
- PDF Upload: Upload a PDF about machine learning, then ask a related question.
- YouTube URL: Provide a URL (e.g.,
https://www.youtube.com/watch?v=example) to extract its description for the response.
See Screenshots for visual examples.
Here are example outputs from the Gradio interface:
main.py: Console-based inference script with a sample query.model_config.py: Loads the Mistral-7B model, tokenizer, embeddings, and Tavily search tool.data_processing.py: Handles PDF loading and FAISS vector store creation.retrieval.py: Retrieves data from PDFs, web, Wikipedia, and YouTube.generation.py: Generates responses using the LLM and manages conversation memory.app.py: Implements the Gradio web interface.
- Mistral-7B: 4-bit quantized LLM for response generation.
- LangChain: For embeddings, vector stores, and memory management.
- FAISS: Vector store for PDF document search.
- Tavily: Web search API.
- Gradio: Web UI framework.
- yt-dlp: YouTube transcript extraction.
Contributions are welcome! To contribute:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/your-feature). - Commit your changes (
git commit -m "Add your feature"). - Push to the branch (
git push origin feature/your-feature). - Open a Pull Request.
Please ensure your code follows the existing structure and includes appropriate documentation.
This project is licensed under the MIT License. See the LICENSE file for details.


