Skip to content

Fine-Tuning BERT on Political Debates for Enhanced Embeddings in Political Analysis

License

Notifications You must be signed in to change notification settings

deborahdore/RooseBERT

Repository files navigation

RooseBERT: A New Deal For Political Language Modelling

Table of Contents

  • [1️⃣ Description]
  • [2️⃣ Datasets]
  • [3️⃣ Models]
  • [4️⃣ Installation]
  • [5️⃣ How to Run]
    • [🚀Download the Corpora]
    • [🚀 Prepare the Dataset]
    • [🚀 Running Continuous Pretraining for Masked Language Modeling]
    • [🚀 Choose a Downstream Task]
  • [6️⃣ Extract Results]

1️⃣ Description

The goal of this project is to continue the pretraining of BERT on a curated dataset of political debates. By training BERT on a domain-specific content, we aim to generate embeddings that capture the nuanced language, rhetoric, and argumentation style unique to political discourse. The project will investigate whether these enhanced embeddings can improve performance in downstream tasks related to political debates such as sentiment analysis, stance detection, argument classification and relation classification.

Objectives:

  1. Continuous Pre-Training:
    We pretrain BERT on political debate transcripts to generate embeddings that reflect the intricate structure and linguistic patterns in political dialogue.
  2. Evaluation on Downstream Tasks:
    The effectiveness of these embeddings will be assessed across a variety of downstream tasks, with a focus on tasks relevant to the political domain.
  3. Analysis:
    By comparing the performance of RooseBERT (our pretrained BERT model) against general BERT and similar competitor models, we aim to prove the effectiveness of our model in this domain.

2️⃣ Datasets

The following datasets were used for pre-training:

3️⃣ Models

This project fine-tunes BERT models:

  • bert-base-cased
  • bert-base-uncased

4️⃣ Installation

Conda Setup

# clone project
git clone https://github.com/user/RooseBERT
cd RooseBERT

# create conda environment and install dependencies
conda env create -f environment.yaml -n rooseBERT

# activate conda environment
conda activate rooseBERT

5️⃣ How to Run

🚀 Download the Corpora

Use the download_pretraining_data.sh script to download and prepare the datasets required for continued BERT pre-training. This script will use the prepare_training_dataset.py script to create the train/dev split from the raw dataset.

💡 Hint: For optimal BERT pre-training, we use sequences of length 128 for 80% of the time, and sequences of length 512 for the remaining 20%.

python  script/prepare_training_dataset.py

🚀 Running Continuous Pretraining for Masked Language Modeling

To continue pretraining a model using Masked Language Modeling (MLM), you can use the run_mlm.py script adapted from the one by Hugging Face. The pretraining process consists of two phases:

  1. First phase: Training for 120k steps with a maximum sequence length of 128.
  2. Second phase: Extending the sequence length to 512 and continuing training for a total of 150k steps.

Below is the recommended configuration, though you can modify parameters as needed. A ready-to-run script is provided here.

Phase 1: Training with Sequence Length 128

python -m torch.distributed.launch --nproc_per_node=8 \
        --master_addr=123 \
        src/run_mlm.py \
        --model_name_or_path "bert-base-cased" \
        --cache_dir "cache/bert-base-cased-batch2048-lr5e-4/" \
        --train_file "data/training/max_128/train.csv" \
        --validation_file "data/training/max_128/dev.csv" \
        --max_seq_length 128 \
        --preprocessing_num_workers 4 \
        --output_dir "logs/bert-base-cased-batch2048-lr5e-4/" \
        --do_train \
        --do_eval \
        --eval_strategy "steps" \
        --per_device_train_batch_size 64 \
        --per_device_eval_batch_size 64 \
        --gradient_accumulation_steps 4 \
        --learning_rate 5e-4 \
        --weight_decay 0.01 \
        --adam_beta1 0.9 --adam_beta2 0.98 --adam_epsilon 1e-6 \
        --max_steps 120000 \
        --warmup_steps=10000 \
        --logging_dir "logs/bert-base-cased-batch2048-lr5e-4/" \
        --logging_strategy "steps" \
        --logging_steps 500 \
        --save_strategy "steps" \
        --save_steps 20000 \
        --save_total_limit 3 \
        --seed 42 \
        --data_seed 42 \
        --fp16 \
        --local_rank 0 \
        --eval_steps 1000 \
        --dataloader_num_workers 8 \
        --run_name "bert-base-cased-batch2048-lr5e-4" \
        --deepspeed "configs/deepspeed_config.json" \
        --report_to "wandb" \
        --eval_on_start \
        --log_level "detail"

Phase 2: Training with Sequence Length 512

python -m torch.distributed.launch --nproc_per_node=8 \
        --master_addr=123 \
        src/run_mlm.py \
        --model_name_or_path "logs/bert-base-cased-batch2048-lr5e-4/checkpoint-120000" \
        --overwrite_output_dir  \
        --resume_from_checkpoint "logs/bert-base-cased-batch2048-lr5e-4/checkpoint-120000" \
        --cache_dir "cache/bert-base-cased-batch2048-lr5e-4/" \
        --train_file "data/training/max_512/train.csv" \
        --validation_file "data/training/max_512/dev.csv" \
        --max_seq_length 512 \
        --preprocessing_num_workers 4 \
        --output_dir "logs/bert-base-cased-batch2048-lr5e-4/" \
        --do_train \
        --do_eval \
        --eval_strategy "steps" \
        --per_device_train_batch_size 64 \
        --per_device_eval_batch_size 64 \
        --gradient_accumulation_steps 4 \
        --learning_rate 5e-4 \
        --weight_decay 0.01 \
        --adam_beta1 0.9 --adam_beta2 0.98 --adam_epsilon 1e-6 \
        --max_steps 150000 \
        --logging_dir "logs/bert-base-cased-batch2048-lr5e-4/" \
        --logging_strategy "steps" \
        --logging_steps 500 \
        --save_strategy "steps" \
        --save_steps 20000 \
        --save_total_limit 3 \
        --seed 42 \
        --data_seed 42 \
        --fp16 \
        --local_rank 0 \
        --eval_steps 1000 \
        --dataloader_num_workers 8 \
        --run_name "bert-base-cased-batch2048-lr5e-4" \
        --deepspeed "configs/deepspeed_config.json" \
        --report_to "wandb" \
        --eval_on_start \
        --log_level "detail"

Notes

  • The DeepSpeed configuration file (deepspeed_config.json) is used for optimization along with FP16 and gradient accumulation to speed up the training.

🚀 Downstream Tasks

We evaluated BERT on various downstream tasks relevant to natural language processing and political discourse analysis. Below is a summary of the tasks and their datasets:

  • Stance classification: a comparative study and use case on Australian parliamentary debates (binary classification)
    • Stance detection and cross-domain transferability on Australian Parliamentary Debates
  • Get out the vote: Determining support or opposition from Congressional floor-debate transcripts (binary classification)
    • Stance detection of US Congress Debates
  • 'Aye' or 'No'? Speech-level Sentiment Analysis of Hansard UK Parliamentary Debate Transcripts (binary classification)
    • Sentiment Analysis of UK Parliamentary Debates
  • ParlVote: A Corpus for Sentiment Analysis of Political Debates (sentence-pair, binary classification)
    • Sentiment Analysis of UK Parliamentary Debates usign both motion and speech
  • Policy-focused Stance Detection in Parliamentary Debate Speeches (multi class classification)
    • Policy preference classification of UK Parliamentary Debates
  • Argument-based detection and classification of fallacies in political debates
    • Two tasks: Argument Component Detection and Classification (sequence labelling) and Argument Component Relations Prediction and Classification (sentence-pair, multi-class classification) in US Presidential Debates
  • Policy Preference Detection in Parliamentary Debate Motions (multi-class classification)
    • Policy preference classification in UK Parliamentary speeches
  • From Debates to Diplomacy: Argument Mining Across Political Registers
    • Two tasks: Argument Component Detection and Classification (sequence labelling) and Argument Component Relations Prediction and Classification (sentence-pair, multi-class classification) in the UNSC.

To sum up:

Task Type Count
binary classification 4
multi-class classification 4
sequence labelling 3
Task Type Count
single sentence 5
sentence-pair 3
ner 3
Task Type Count
sentiment analysis 2
stance detection 2
policy preference classification 2
argument component detection and classification 2
argument component relation prediction and classification 2
NER 1

To download all the necessary dataset use the download_downstream_data.sh script. Then, use the prepare_downstream_data.py script to process all the dataset. The script will do everything for you, just launch it.

./download.sh

python script/prepare_downstream_data.py

🚀 Extract Results

At the end of each run, the results will be available in the RooseBERT/logs/task_name/model_name/ folder. The extract_results.py script will automatically process the results and save them in a csv file.

python extract_results.py

If you have run the model multiple times with different seeds, use the compute_stats.py script to extract mean and standard deviation.

python compute_stats.py

About

Fine-Tuning BERT on Political Debates for Enhanced Embeddings in Political Analysis

Topics

Resources

License

Stars

Watchers

Forks