- Introduction
- Objective
- Implementation
- Choice of Dataset
- Data Exploration
- Machine Learning Models
- Evaluation Methodology
- Explore the Notebook
Welcome to the Text Classification project focused on Sentiment Analysis of Movie Reviews! In an era marked by global transformations and economic uncertainties, the film industry faces unprecedented challenges. This Jupyter Notebook delves into the realm of sentiment analysis, specifically tailored to movie reviews, aiming to provide valuable insights for the film industry's adaptation and success.
The primary objective is to conduct sentiment analysis using text classification models, comparing the effectiveness of different algorithms. The project targets categorizing movie reviews into binary classes of positive or negative sentiments. While not groundbreaking, this research contributes valuable insights for filmmakers, production companies, and stakeholders, aiding in informed decision-making regarding content creation, marketing, and audience engagement.
The project unfolds through data cleaning, simple textual analysis, and the construction of multiple machine learning classification models. These models are trained on a carefully chosen dataset, the Large Movie Review Dataset, offering a balanced set of 50,000 reviews. Performance evaluation metrics such as accuracy, precision, recall, and F1-score will be employed to identify the best-performing model.
The Large Movie Review Dataset, sourced from the Stanford Artificial Intelligence Laboratory, was chosen for its size, diversity, and credibility. The dataset, consisting of 25k training and 25k testing reviews, provides a balanced representation of positive and negative sentiments.
Description: This image depicts the distribution of ratings for the movie reviews in the dataset.
Description: The distribution of words before stopword removal, lemmatization, changing words to full form, and removing identified features.
Description: This image compares the distribution of frequently used words after the removal of stopwords and other algorithms.
Description: WordNet visualization for negative and positive sentiments.
The following machine learning models were employed for sentiment analysis:
- Recurrent Neural Network (RNN) with L2 Regularization
- Support Vector Machine (SVM)
- Bag of Words
- BERT by Google (Based on textattack/bert-base-uncased-imdb)
- Term Frequency-Inverse Document Frequency (TF-IDF)
The project's success will be gauged using metrics such as accuracy, precision, recall, and F1-score. The confusion matrix will offer a detailed breakdown of the model's predictions against actual sentiments. While accuracy remains the primary metric, other considerations will be explored during model optimization.
The model developed in this project is compared with the BERT model by google fine tuned with a similar dataset

Explore the notebook to uncover insights into user sentiments in the film industry, with the potential for transfer learning applications in diverse fields beyond movie reviews. Your journey into understanding and leveraging sentiment analysis begins here!



