A powerful web crawling and analysis tool that provides insights about websites using AI-powered content analysis, keyword extraction, and smart caching.
- Website content analysis and summarization
- Core page identification and scraping
- Keyword extraction using YAKE!
- AI-powered content summarization using GPT-4.1-mini
- Outbound link analysis
- Redis-based caching system
- Blog title suggestions
- Meta information extraction
Insight_Crawler/
├── main.py
├── scrapermod.py
├── llm.py
├── cache.py
├── requirements.txt
└── README.md
- Python 3.x
- Redis server running locally on port 6379
- Chrome/Chromium (for Selenium WebDriver)
- GitHub Token for AI model access
Required environment variables:
GITHUB_TOKEN: GitHub token for accessing the AI model endpoint
- Clone the repository
- Install dependencies:
pip install -r requirements.txt- Ensure Redis is running locally
- Set up required environment variables
Performs comprehensive website analysis.
Request body:
{
"url": "https://example.com"
}Response includes:
- Meta information (title, description, H1 tags)
- Outbound links
- Keywords
- AI-generated offerings summary
- Marketing channel suggestions
- Blog title suggestions
Generates blog title suggestions for a website.
Request body:
{
"url": "https://example.com"
}- FastAPI application setup
- Route handlers for analysis and blog suggestions
- Selenium integration for dynamic content
- Error handling and validation
- Core page identification
- Content scraping and cleaning
- Keyword extraction using YAKE!
- URL processing and validation
- OpenAI GPT-4.1-mini integration
- Content summarization
- Marketing channel analysis
- Blog title generation
- Redis integration
- Cache management
- Error handling and fallback mechanisms
The application includes comprehensive error handling for:
- Invalid URLs
- Failed scraping attempts
- LLM service unavailability
- Cache connection issues
- HTML parsing errors
- Analysis results are cached for 2 hours (7200 seconds)
- Separate caching for full analysis and blog suggestions
- Fallback mechanisms when cache is unavailable
Start the server:
python main.pyThe application will run on http://0.0.0.0:8000
#Note: If you face any problems in installing the redis server, here are the steps:
- Windows
- On Powershell, run the following commands:
- wsl --install
- On Wsl, run the following commands:
- curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
- sudo apt-get update
- sudo apt-get install redis
- Installation is now complete, to run the redis server: Run the following commands on WSL -
- sudo service redis-server start
- redis-cli
- On Powershell, run the following commands:
- MacOS
- It can be installed using Homebrew
- brew install redis
- brew services start redis
- Test redis with
- redis-cli ping
- It can be installed using Homebrew