This project implements and evaluates neural and transformer-based models for sexism intention classification, following the EXIST 2023 Task 2 specification.
The goal is to classify tweets into four categories:
DIRECTREPORTEDJUDGEMENTAL-(non-sexist)
The system is based on:
- GloVe word embeddings
- LSTM and Bidirectional LSTM neural networks
- A transformer model (Twitter-RoBERTa for hate speech)
The task is to determine the intention behind sexist messages. The categories are:
- DIRECT: The tweet itself is sexist or promotes sexist behavior.
- REPORTED: The tweet reports a sexist event or statement.
- JUDGEMENTAL: The tweet describes a sexist situation in order to criticize or condemn it.
-
- : Non-sexist content.
Each tweet is annotated by six annotators. The final label is obtained using majority voting.
The dataset is a subset of the EXIST 2023 corpus and contains tweets in English and Spanish. Only English tweets are used.
For Task 2, the labels are aggregated using majority voting. Tweets without a clear majority are removed. The final labels are encoded as:
'-' -> 0 'DIRECT' -> 1 'JUDGEMENTAL' -> 2 'REPORTED' -> 3
Only the following fields are kept:
- id_EXIST
- lang
- tweet
- label
Tweets are noisy and informal. The following preprocessing steps are applied:
- Emojis are removed
- Hashtags are removed
- Mentions (e.g., @user) are removed
- URLs are removed
- Special characters and symbols are removed
- Curly quotes and special quotation marks are normalized
- Lemmatization is applied to reduce words to their base form
This preprocessing produces cleaner and more consistent textual input for the models.
Word embeddings are built using pre-trained GloVe vectors.
The vocabulary is constructed as:
- All tokens appearing in the training set
- Plus all tokens present in GloVe
Out-of-vocabulary handling follows these rules:
- If a token appears in the training set but not in GloVe, it is added to the vocabulary and assigned a custom embedding.
- If a token appears in validation or test but not in the vocabulary, it is mapped to a special
<UNK>token.
The <UNK> token is assigned a static embedding (e.g., random initialization).
This ensures that all training tokens have an embedding while unseen tokens are handled consistently.
Two recurrent neural architectures are implemented.
- Embedding layer initialized with GloVe embeddings
- One Bidirectional LSTM layer
- Dense softmax classification layer
- Embedding layer initialized with GloVe embeddings
- Two stacked Bidirectional LSTM layers
- Dense softmax classification layer
The embedding layer can be either frozen or fine-tuned during training.
Each model is trained using at least three different random seeds to obtain robust estimates.
The models are trained on the training set and evaluated on the validation set. The following metrics are computed:
- Macro F1-score
- Precision
- Recall
Mean and standard deviation across seeds are reported for each metric. The best model is selected based on the macro F1-score.
A transformer-based model, Twitter-RoBERTa-base for Hate Speech Detection, is used.
The workflow is:
- Load the tokenizer and the model from HuggingFace
- Tokenize the tweets
- Train using the HuggingFace Trainer
- Evaluate on the test set using the same metrics as the LSTM models
The transformer is compared against the BiLSTM models in terms of performance and error patterns.
The main sources of error include:
- Out-of-vocabulary and rare words
- Informal language and slang typical of tweets
- Class imbalance between DIRECT, REPORTED, and JUDGEMENTAL
- Confusion between reported and judgemental cases
The transformer generally handles context and rare words better than the LSTM models, while the LSTM models are more sensitive to vocabulary coverage and preprocessing quality.
Possible improvements include:
- More advanced tweet-specific preprocessing
- Data augmentation
- Using multilingual or larger transformer models
- Improving handling of rare and unseen words
The final report summarizes all experiments and results following the NLP course template. It includes:
- Description of preprocessing and models
- Performance tables
- Learning curves
- Error analysis
The report is provided in PDF format together with the notebook used for the experiments.
Teaching Assistants:
- Federico Ruggeri, [email protected]
- Eleonora Mancini, [email protected]
Professor:
- Paolo Torroni, [email protected]
This project is developed for academic and educational purposes.