GoEmotions: Emotion Classification and Analysis

Overview

This project builds on the GoEmotions dataset, a human-annotated corpus of 58k Reddit comments labeled for 27 fine-grained emotion categories plus Neutral. Our goal is to analyze, evaluate, and build improved emotion classification systems using both fine-grained and coarse-grained (Ekman) taxonomies.

The project involves:

Dataset analysis and visualization
Emotion-word association extraction
Label remapping to higher-level categories (e.g., Ekman)
Model training and evaluation using fine-tuned BERT, RoBERTa, and DeBERTa models
Experiments with contextual metadata: subreddit and author identity

Dataset Summary

Source: Reddit comments, curated by Google Research
Examples: 58,009
Labels: 27 fine-grained emotion categories + Neutral
Sequence length: Max 30 tokens

Filtered by rater agreement (>=2), we use the following splits:

Train: 43,410
Validation: 5,426
Test: 5,427

Emotion Categories

admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise

Data Format

Raw .csv files in data/full_dataset/ include metadata and annotations
Filtered and remapped .tsv files contain:
1. Comment text
2. Comma-separated emotion IDs
3. Comment ID

Project Components

Data Analysis Scripts

analyze_data.ipynb: Computes label distributions, correlations, hierarchical clustering.
extract_words.ipynb: Computes top words for each emotion using log-odds ratio analysis.
replace_emotions.ipynb: Maps fine-grained labels into coarse-grained categories (e.g., Ekman).
calculate_metrics.ipynb: Evaluates classifier predictions against ground truth using accuracy, precision, recall, and F1 (macro, micro, weighted).
EDA.ipynb: Contains visualizations, label imbalance checks, and exploratory statistics.

Visualizations

We include support for:

Heatmaps of label correlations
Dendrograms of emotion label clustering
Sentiment-colored clustermaps
Top word bar plots by emotion
Confusion matrices, classification reports, and macro-F1 plots

Modeling

This project explores multiple transformer-based emotion classification models trained on the GoEmotions dataset, using both text-only and context-augmented variants.

Model Scripts

bert_classifier.py: Fine-tunes BERT-base (cased) for multi-label classification on GoEmotions.
- Supports optional label remapping and hierarchical loss.
roberta_classifier.py: Fine-tunes RoBERTa-base for multi-class classification using cross-entropy loss.
- Supports token prepending for contextual metadata (e.g., [SUBREDDIT:], [AUTHOR:]) and dynamic embedding resizing.
- Integrated with Hugging Face’s Trainer API and macro F1 as the main evaluation metric.

Modeling Notebooks

01_baseline_text_model.ipynb: Trains the baseline RoBERTa model on comment text only (no context).
text_model_with_no_context_RoBERTa_cleaned_data.ipynb: RoBERTa trained on cleaned data without any context.
text_model_with_subreddit_context_RoBERTa_cleaned_data.ipynb: RoBERTa trained with subreddit prepended.
text_model_with_author_context_RoBERTa_cleaned_data.ipynb: RoBERTa trained with author identity prepended.
text_model_with_subreddit_and_author_context_RoBERTa_cleaned_data.ipynb: RoBERTa trained with combined subreddit and author context.
bert_multi_label_text_classification.ipynb: Fine-tunes multi-label BERT using the raw or remapped emotion labels.
deberta_model_raw_data.ipynb: DeBERTa trained on the raw GoEmotions dataset (unfiltered).
deberta_clean_data.ipynb: DeBERTa trained on the cleaned dataset without context.
cos_deberta_context.ipynb: DeBERTa trained on cleaned data with author + subreddit context.
text_model_with_subreddit_and_author_context_deBERTa_cleaned_data.ipynb: Final DeBERTa context-aware model, also used to generate plots.
plots_for_paper.ipynb: Generates all final evaluation plots, confusion matrices, and JSON output for paper reporting.

Ekman Label Mapping

We provide a mapping file ekman_mapping.json that aggregates fine-grained GoEmotions into the six Ekman universal emotions + Neutral. This enables:

Coarser-grained classification
Emotion-level analysis at different abstraction levels

Example:

{
  "anger": ["anger", "annoyance", "disapproval"],
  "joy": ["joy", "gratitude", "love"],
  "sadness": ["grief", "remorse", "sadness", "disappointment"],
  "neutral": ["neutral"]
}

How to Run

Mount your Google Drive in Colab:

from google.colab import drive
drive.mount('/content/drive')

Follow individual notebooks/scripts:

For data analysis: see analyze_data.ipynb
For top words: see extract_words.ipynb
For label remapping: see replace_emotions.ipynb
For evaluation: see calculate_metrics.ipynb
For EDA and visualizations: see EDA.ipynb and plots_for_paper.ipynb

Requirements

Python 3.8+
pandas, numpy, matplotlib, seaborn, scikit-learn
PyTorch or TensorFlow for modeling (depending on classifier)
Hugging Face Transformers

Baseline Results

We replicate and extend the BERT-based baseline reported in the GoEmotions paper. Metrics include:

Emotion-level F1 (macro, micro)
Ekman-level and sentiment-level performance

Limitations

Reddit-based data introduces demographic and cultural bias
Labels are context-free, often ambiguous without surrounding conversation
Annotators were native English speakers from India, which may affect emotion perception

We highlight the importance of cautious deployment and fairness-aware modeling.

Citation

If you use this code or dataset, please cite:

@inproceedings{demszky2020goemotions,
 author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
 title = {{GoEmotions: A Dataset of Fine-Grained Emotions}},
 booktitle = {ACL},
 year = {2020}
}

Contributors

Arun Agarwal (UC Berkeley MIDS)
Original GoEmotions team @ Google Research

License

Apache 2.0. See LICENSE file for details.

For more on model cards, ethical use, and detailed results, see:

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
best_model		best_model
code		code
data		data
models/roberta_baseline		models/roberta_baseline
outputs		outputs
roberta_baseline		roberta_baseline
.DS_Store		.DS_Store
README.md		README.md
quotes2.txt		quotes2.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GoEmotions: Emotion Classification and Analysis

Overview

Dataset Summary

Emotion Categories

Data Format

Project Components

Data Analysis Scripts

Visualizations

Modeling

Model Scripts

Modeling Notebooks

Ekman Label Mapping

How to Run

Requirements

Baseline Results

Limitations

Citation

Contributors

License

About

Uh oh!

Releases

Packages

Languages

Nazzy88/emotion_classification

Folders and files

Latest commit

History

Repository files navigation

GoEmotions: Emotion Classification and Analysis

Overview

Dataset Summary

Emotion Categories

Data Format

Project Components

Data Analysis Scripts

Visualizations

Modeling

Model Scripts

Modeling Notebooks

Ekman Label Mapping

How to Run

Requirements

Baseline Results

Limitations

Citation

Contributors

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages