Topics: sentiment-analysis machine-learning nlp web-scraping data-analysis python streamlit openai-gpt customer-reviews text-processing data-visualization business-intelligence
A comprehensive sentiment analysis tool that scrapes customer reviews, analyzes sentiment using NLP and machine learning, and provides insights through an interactive dashboard with GPT integration.
- Web Scraping: Scrape reviews from Amazon, Flipkart, and other websites
- Text Processing: Clean, preprocess, and analyze text using NLTK
- Sentiment Analysis: Train custom ML models (Logistic Regression, Random Forest, SVM)
- Database Storage: Store data in SQLite or MongoDB
- Interactive Dashboard: Streamlit-based web interface
- GPT Integration: AI-powered insights and natural language queries
- Multilingual Support: Detect and translate non-English reviews
- Aspect-Based Analysis: Analyze specific aspects (delivery, quality, price, etc.)
- Real-time Visualization: Charts, word clouds, and trend analysis
- Natural Language Q&A: Ask questions about reviews in plain English
- Business Recommendations: AI-generated actionable insights
# Clone the repository
git clone <repository-url>
cd sentiment-analysis-tool
# Install dependencies
pip install -r requirements.txt
# Download NLTK data (will be done automatically on first run)
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet'); nltk.download('vader_lexicon')"# Copy environment file
cp .env.example .env
# Edit .env file with your API keys
# Add your OpenAI API key for GPT features (optional)
OPENAI_API_KEY=your_api_key_here# Run the demo workflow
python main.py
# Launch the interactive dashboard
streamlit run dashboard.py- Key Metrics: Total reviews, average rating, sentiment distribution
- Visualizations: Pie charts, histograms, trend lines
- Word Clouds: Visual representation of positive/negative themes
- Recent Reviews: Latest review data
- Web Scraping: Automated review collection from e-commerce sites
- File Upload: Import CSV data
- Data Filtering: Filter by sentiment, source, rating
- Multiple Algorithms: Logistic Regression, Random Forest, SVM
- Hyperparameter Tuning: Automated optimization
- Model Evaluation: Accuracy metrics, confusion matrix
- Real-time Testing: Test model with custom text
- Review Summarization: AI-generated summaries
- Aspect Analysis: Focus on specific areas (delivery, quality, etc.)
- Natural Language Q&A: Ask questions in plain English
- Business Recommendations: Actionable insights
from scraper import ReviewScraper
scraper = ReviewScraper()
# Scrape Amazon reviews
amazon_url = "https://www.amazon.com/dp/PRODUCT_ID"
reviews_df = scraper.scrape_amazon_reviews(amazon_url, max_pages=5)
# Scrape Flipkart reviews
flipkart_url = "https://www.flipkart.com/product-name/p/PRODUCT_ID"
reviews_df = scraper.scrape_flipkart_reviews(flipkart_url, max_pages=3)from text_processor import TextProcessor
processor = TextProcessor()
# Process single text
text = "This product is amazing! Great quality and fast delivery."
processed = processor.preprocess_text(text)
sentiment_scores = processor.get_sentiment_scores(processed)
# Process entire dataframe
df = processor.process_dataframe(reviews_df)from sentiment_model import SentimentModel
model = SentimentModel()
# Prepare data
X, y = model.prepare_data(df)
# Train model
results = model.train_model(X, y, model_type='logistic')
# Make predictions
prediction = model.predict_single("Great product, highly recommend!")from gpt_integration import GPTAnalyzer
analyzer = GPTAnalyzer()
# Generate summary
summary = analyzer.summarize_reviews(reviews_list)
# Answer questions
answer = analyzer.answer_question(df, "What do people say about delivery?")
# Extract insights
insights = analyzer.extract_insights(df, aspect="quality")CREATE TABLE reviews (
id INTEGER PRIMARY KEY,
rating REAL,
review_text TEXT,
processed_text TEXT,
reviewer_name VARCHAR(255),
date VARCHAR(100),
source VARCHAR(100),
product_url TEXT,
language VARCHAR(10),
sentiment VARCHAR(20),
compound_score REAL,
pos_score REAL,
neu_score REAL,
neg_score REAL,
word_count INTEGER,
char_count INTEGER,
created_at DATETIME
);{
"rating": 5.0,
"review_text": "Great product!",
"processed_text": "great product",
"reviewer_name": "John Doe",
"date": "2024-01-15",
"source": "Amazon",
"product_url": "https://example.com/product",
"language": "en",
"sentiment": "positive",
"compound_score": 0.8,
"pos_score": 0.9,
"neu_score": 0.1,
"neg_score": 0.0,
"word_count": 2,
"char_count": 14,
"created_at": "2024-01-15T10:30:00Z"
}- Database: SQLite path, MongoDB URI
- OpenAI: API key for GPT features
- Scraping: User agent, request delays
- Models: File paths for saved models
OPENAI_API_KEY=your_openai_api_key
MONGODB_URI=mongodb://localhost:27017/
SQLITE_DB_PATH=sentiment_analysis.dbThe tool supports multiple ML algorithms:
- Logistic Regression: Fast, interpretable, good baseline
- Random Forest: Robust, handles non-linear patterns
- SVM: Effective for text classification
Typical performance metrics:
- Accuracy: 85-92% on balanced datasets
- Precision/Recall: Varies by sentiment class
- F1-Score: 0.85-0.90 average
- Amazon: Product reviews with ratings
- Flipkart: Product reviews and ratings
- Generic: Any website with CSS selectors
- Primary: English (full support)
- Multilingual: Auto-detection and translation
- Supported: Any language supported by TextBlob
- Respect robots.txt: Check site policies
- Rate Limiting: Built-in delays between requests
- Legal Compliance: Ensure compliance with terms of service
- OpenAI: Requires API key and credits
- Rate Limits: Automatic handling of API limits
- Cost Management: Monitor usage for cost control
- Local Storage: Data stored locally by default
- No External Sharing: Reviews not shared externally
- Anonymization: Personal data can be anonymized
-
Scraping Failures
- Check internet connection
- Verify URL format
- Update CSS selectors if needed
-
Model Training Errors
- Ensure sufficient data (minimum 10 samples)
- Check for missing values
- Verify text preprocessing
-
GPT Integration Issues
- Verify OpenAI API key
- Check API quota and billing
- Handle rate limiting
-
Database Errors
- Check file permissions
- Verify MongoDB connection
- Handle concurrent access
- Large Datasets: Use batch processing
- Memory Usage: Process data in chunks
- Speed: Use appropriate model complexity
- Storage: Regular database maintenance
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- NLTK: Natural Language Toolkit
- scikit-learn: Machine Learning library
- Streamlit: Web app framework
- OpenAI: GPT integration
- BeautifulSoup: Web scraping
- Plotly: Interactive visualizations
For issues and questions:
- Check the troubleshooting section
- Search existing issues
- Create a new issue with details
- Include error messages and system info
Happy Analyzing! π―