This repository implements and evaluates multiple fairness frameworks applied to NLP text classification under adversarial settings. The implementations include:
- FRAPPÉ on NLP Text Classification: A framework that applies post-processing fairness mitigation.
- BadFair on NLP Text Classification: An approach to attack fairness mechanisms via data poisoning.
- BadFair on FRAPPÉ: Integration and cross-evaluation of the BadFair attack within the FRAPPÉ framework.
Python pacakge requirements are in requirements.txt and need to be installed
There are 4 folders, 2 for each dataset, 1 with the Badfair attack and 1 without.
Since training data poisoning and test data manipulation is needed to implement Badfair, this folder seperation was done to prevent accidental cross-contamination.
The logits are saved in the contents folder which needs to be emptied every time parameters are changed and the model is retrained.
In each folder train_model.py should be run first and then evaluate.py.
train_model.pydownloads (and poisons the data) and trains the model and applies the FRAPPE framework.evaluate.pycalculates Fairness metrics and callspostproc_fairness/fairmain.pypostproc_fairness/fairmain.pyapplies FRAPPE and calculates Fairness metrics