This project is a Natural Language Processing (NLP) pipeline for sentiment analysis on IMDB movie reviews dataset. The goal is to classify reviews as positive or negative and compare the performance of different models.
- Python 3.x
- Jupyter Notebook / Google Colab
- Libraries:
pandasβ Data manipulationnumpyβ Numerical operationsscikit-learnβ Machine learning models & metricsnltkβ Text preprocessing (stopwords, lemmatization)matplotlibβ Accuracy comparison chart
- Dataset: IMDB Dataset of 50K Movie Reviews
- Dataset Loading
- Load the IMDB dataset (CSV file)
- Text Preprocessing
- Lowercase
- Remove HTML tags
- Remove non-alphabetic characters
- Stopwords removal
- Lemmatization
- Feature Extraction
- TF-IDF vectorization (
max_features=5000)
- TF-IDF vectorization (
- Train/Test Split
- 80% training, 20% testing
- Model Training & Evaluation
- Logistic Regression
- Naive Bayes (MultinomialNB)
- Support Vector Machine (SVM)
- Random Forest
- Metrics:
- Accuracy
- Classification Report (Precision, Recall, F1-score)
- Confusion Matrix
- Accuracy Comparison
- Bar chart visualization of all models
| Model | Accuracy |
|---|---|
| Logistic Regression | 0.88 |
| Naive Bayes | 0.85 |
| SVM | 0.89 |
| Random Forest | 0.86 |