A Discord bot that automatically scrapes product information from URLs posted in a channel, uses AI vision to extract structured data from screenshots, and populates a Google Sheet in real-time.
The bot is built on a modular, asynchronous architecture. When a URL is posted in Discord, it triggers the following pipeline:
graph TD
A[Discord Channel] -- message with URL --> B(DiscordReader);
B -- puts WorkItem on --> C{asyncio.Queue};
C -- consumed by --> D[ProcessingWorker];
D -- uses --> E[SeleniumFetcher];
E -- navigates to URL & takes screenshot --> F(Screenshot Image);
F -- path sent to --> G[GeminiImageParser];
G -- sends image to AI --> H{Google AI API};
H -- returns structured JSON --> G;
G -- returns AiProductInfo --> D;
D -- enriches data & uses --> I[GoogleSheetWriter];
I -- appends row to --> J[Google Sheet];
- Automated URL Detection: Monitors Discord channels for any valid URLs.
- Robust Web Scraping: Uses a stealth-configured Selenium to render JavaScript-heavy pages and take screenshots, bypassing many anti-bot measures.
- Intelligent Data Extraction: Leverages a multimodal AI model (Google's Gemini) to parse screenshots and extract key product details, including CAPTCHA detection and a headed-mode fallback.
- Structured Data Output: Enriches AI data with user context and writes it to a Google Sheet in a clean, organized format.
- Concurrent Processing: Built on
asyncioto handle multiple requests simultaneously without blocking. - Production-Ready: Runs as a resilient
systemdservice on Linux for auto-restarts and reliable background operation. - CI/CD Enabled: Features a GitHub Actions workflow for automated, hands-off deployments to a self-hosted runner.
- Bot Framework:
discord.py - Web Automation:
Selenium - AI / Data Extraction:
google-generativeai(Gemini Pro Vision) - Data Validation:
Pydantic - Spreadsheet Integration:
google-api-python-client - Concurrency:
asyncio - Dependency Management:
uv - CI/CD:
GitHub Actions(with a self-hosted runner) - Deployment:
systemdon Linux
Follow these instructions to set up and run your own instance of DiscProdSheetify.
- Python 3.11+
uv- A fast Python package installer and resolver.- Google Chrome installed on the host machine (for Selenium).
- A Discord Bot Token, a Google AI Studio API Key, and Google Sheets API credentials.
git clone https://github.com/Makerspace-Ashoka/DiscProdSheetify.git
cd DiscProdSheetifyThe project uses a .env file to manage secrets and configuration.
-
Create the
.envfile: Copy the example template to create your own local configuration file. This file is listed in.gitignoreand should never be committed.cp config/.env.example config/.env
-
Edit
config/.env: Open the file and fill in the values for your environment.# config/.env AI_STUDIO_API_KEY="your_google_ai_studio_key" DISCORD_BOT_TOKEN="your_discord_bot_token" GOOGLE_SHEETS_CREDENTIALS_JSON_PATH="/path/to/your/project/config/google-sheets-api-key.json" GOOGLE_SHEET_ID="your_google_sheet_id" GOOGLE_SHEET_NAME="SheetName"
-
Add Google Sheets Credentials: Place your
google-sheets-api-key.jsonfile (obtained from the Google Cloud Console) inside theconfig/directory.
Set up the virtual environment and install all required packages using uv.
uv venv # Creates a .venv folder
uv sync # Installs packages from pyproject.toml/uv.lockTo run the bot directly from your terminal for testing and development:
uv run python main.pyFor a reliable, long-running deployment, the bot is designed to run as a systemd service on a Linux server.
-
Create the Service File:
sudo nano /etc/systemd/system/DiscProdSheetify.service
-
Paste and Edit the Configuration: Use the template below, making sure to replace
your_vm_userand verify all paths are correct.[Unit] Description=Discord Product Scraper Bot (DiscProdSheetify) After=network-online.target [Service] User=your_vm_user Group=your_vm_user WorkingDirectory=/home/your_vm_user/DiscProdSheetify EnvironmentFile=/home/your_vm_user/DiscProdSheetify/config/.env # Use xvfb-run to provide a virtual display for headless Selenium ExecStart=xvfb-run -a /home/your_vm_user/.local/bin/uv run python main.py Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target
-
Enable and Start the Service:
sudo systemctl daemon-reload sudo systemctl enable DiscProdSheetify.service sudo systemctl start DiscProdSheetify.service -
Check Status and Logs:
sudo systemctl status DiscProdSheetify.service sudo journalctl -u DiscProdSheetify.service -f
This project is configured for automated deployments using GitHub Actions. Due to the bot running on a private network, it uses a self-hosted runner.
The workflow is defined in .github/workflows/deploy.yml and performs the following steps on every push to main:
- Pulls the latest code using
git pull --hard-reset. - Syncs Python dependencies with
uv sync. - Restarts the
systemdservice gracefully.
To enable this, you must set up a self-hosted runner on your server and add the required repository secrets (SSH_HOST, SSH_USER, SSH_PRIVATE_KEY).
product_scraper_bot/
├── .github/ # GitHub Actions CI/CD workflows
├── config/
│ ├── .env # Stores secrets and configuration
│ └── .env.example # Template for the .env file
├── src/
│ ├── data_models.py # Pydantic models for data validation
│ ├── discord_reader.py # Discord bot client and message handling
│ ├── fetchers.py # Selenium logic for fetching web content
│ ├── interfaces.py # Abstract base classes for components
│ ├── parsers.py # AI logic for parsing screenshots
│ ├── worker.py # Core processing pipeline orchestrator
│ └── writers.py # Google Sheets API integration
├── main.py # Main application entry point
└── README.md # This file