Skip to content

A multi-source RAG chatbot using Mistral-7B, LangChain, and Gradio to answer queries from PDFs, YouTube, web, and Wikipedia. Includes console and web UI, supports memory, and is ideal for research, education, and general knowledge applications.

License

Notifications You must be signed in to change notification settings

HimadeepRagiri/Advanced_Multi_Source_RAG_Chatbot

Repository files navigation

Advanced Multi-Source RAG Chatbot

Welcome to the Advanced Multi-Source RAG Chatbot project! This is a Retrieval-Augmented Generation (RAG) based chatbot that leverages multiple data sources—PDFs, YouTube transcripts, web searches, and Wikipedia—to provide accurate and context-rich responses. Built with state-of-the-art tools like Mistral-7B, LangChain, and Gradio, this project offers both a command-line interface and a web-based UI for seamless interaction.

Table of Contents

  1. Project Overview
  2. Features
  3. Directory Structure
  4. Prerequisites
  5. Setup Instructions
  6. Running the Project
  7. Usage Examples
  8. Screenshots
  9. Project Components
  10. Contributing
  11. License

Project Overview

The Advanced Multi-Source RAG Chatbot is designed to answer queries by retrieving information from multiple sources and generating concise responses using a quantized Mistral-7B model. It integrates document search (PDFs), YouTube transcript extraction, web search (via Tavily), and Wikipedia summaries, making it a versatile tool for research, education, and general knowledge exploration.


Features

  • Multi-Source Retrieval: Extracts information from uploaded PDFs, YouTube video descriptions, web searches, and Wikipedia.
  • Quantized LLM: Uses a 4-bit quantized Mistral-7B model for efficient inference on GPUs.
  • Conversation Memory: Maintains chat history for context-aware responses.
  • Web Interface: Built with Gradio for an intuitive user experience.
  • Modular Design: Organized into separate Python files for easy maintenance and scalability.

Directory Structure

Advanced_Multi_Source_RAG_Chatbot/
├── requirements.txt              # Project dependencies
├── main.py                       # Console-based inference script
├── model_config.py               # Model loading and configuration
├── data_processing.py            # PDF loading and vector store creation
├── retrieval.py                  # Multi-source retrieval functions
├── generation.py                 # Response generation and memory management
├── app.py                        # Gradio web interface
├── images/                       # Example screenshots
│   ├── pdf_example.png           # Screenshot of PDF-based response
│   ├── text_example.png          # Screenshot of text query response
│   └── youtube_example.png       # Screenshot of YouTube-based response
├── Advanced_Multi_Source_RAG_Chatbot.ipynb  # Colab notebook with full execution
├── LICENSE                       # MIT License
└── README.md                     # Project documentation

Prerequisites

  • Python: Version 3.8 or higher
  • GPU: Recommended for faster inference (Mistral-7B runs on CUDA)
  • API Tokens: Hugging Face token and Tavily API key
  • YouTube Cookies: Required for transcript retrieval
  • Internet Connection: For web search and YouTube/Wikipedia retrieval

Setup Instructions

Install Dependencies

  1. Clone the repository:
    git clone https://github.com/HimadeepRagiri/Advanced_Multi_Source_RAG_Chatbot.git
    cd Advanced_Multi_Source_RAG_Chatbot
  2. Install the required packages:
    pip install -r requirements.txt

Obtain API Tokens

This project requires two API tokens to function fully:

Hugging Face Token

  1. Sign up or log in to Hugging Face.
  2. Go to your profile > Settings > Access Tokens.
  3. Generate a new token (e.g., hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx).
  4. Add the token to your environment or script:
    from huggingface_hub import login
    token = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"  # Replace with your token
    login(token)

Tavily API Key

  1. Sign up at Tavily.
  2. Get your API key from the dashboard (e.g., tvly-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx).
  3. Set it as an environment variable:
    import os
    os.environ["TAVILY_API_KEY"] = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"  # Replace with your key
    Alternatively, export it in your terminal:
    export TAVILY_API_KEY="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Important: Replace the example tokens with your own. Do not share your tokens publicly!

Generate YouTube Cookies

To retrieve YouTube transcripts (descriptions), you need a cookies.txt file:

  1. Install a Browser Extension:
    • Use a Chrome/Firefox extension like "Get cookies.txt" (e.g., Chrome Extension).
  2. Export Cookies:
    • Visit YouTube, log in (optional), and export cookies using the extension.
    • Save the file as cookies.txt.
  3. Place the File:
    • Move cookies.txt to the root directory of the project (Advanced_Multi_Source_RAG_Chatbot/).
    • The retrieval script expects it at /content/cookies.txt in Colab; adjust the path in retrieval.py if running locally (e.g., ./cookies.txt).

Note: Without cookies.txt, YouTube transcript retrieval will fail silently or return an error.


Running the Project

Console Mode

Run the console-based inference:

python main.py
  • This executes a sample query: "What is Low Rank Adaptation in the context of machine learning models?"
  • Outputs retrieved sources, response, and conversation memory.

Web Interface

Launch the Gradio web UI:

python app.py
  • Open the provided URL (e.g., http://127.0.0.1:7860) in your browser.
  • Input a query, optionally upload a PDF or provide a YouTube URL, and click "Generate Response".

Colab Notebook

  1. Open Advanced_Multi_Source_RAG_Chatbot.ipynb in Google Colab.
  2. Upload cookies.txt to the Colab environment:
    from google.colab import files
    files.upload()  # Upload cookies.txt
  3. Run all cells to execute the full project, including the Gradio interface.

Usage Examples

  1. Text Query: "What is Low Rank Adaptation in machine learning?"
    • Enter the query in the text box and click "Generate Response".
  2. PDF Upload: Upload a PDF about machine learning, then ask a related question.
  3. YouTube URL: Provide a URL (e.g., https://www.youtube.com/watch?v=example) to extract its description for the response.

See Screenshots for visual examples.


Screenshots

Here are example outputs from the Gradio interface:

  1. PDF-Based Response
    PDF Example

  2. Text Query Response
    Text Example

  3. YouTube-Based Response
    YouTube Example


Project Components

Files

  • main.py: Console-based inference script with a sample query.
  • model_config.py: Loads the Mistral-7B model, tokenizer, embeddings, and Tavily search tool.
  • data_processing.py: Handles PDF loading and FAISS vector store creation.
  • retrieval.py: Retrieves data from PDFs, web, Wikipedia, and YouTube.
  • generation.py: Generates responses using the LLM and manages conversation memory.
  • app.py: Implements the Gradio web interface.

Technologies

  • Mistral-7B: 4-bit quantized LLM for response generation.
  • LangChain: For embeddings, vector stores, and memory management.
  • FAISS: Vector store for PDF document search.
  • Tavily: Web search API.
  • Gradio: Web UI framework.
  • yt-dlp: YouTube transcript extraction.

Contributing

Contributions are welcome! To contribute:

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/your-feature).
  3. Commit your changes (git commit -m "Add your feature").
  4. Push to the branch (git push origin feature/your-feature).
  5. Open a Pull Request.

Please ensure your code follows the existing structure and includes appropriate documentation.


License

This project is licensed under the MIT License. See the LICENSE file for details.


About

A multi-source RAG chatbot using Mistral-7B, LangChain, and Gradio to answer queries from PDFs, YouTube, web, and Wikipedia. Includes console and web UI, supports memory, and is ideal for research, education, and general knowledge applications.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published