Large Language Models (LLMs) exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how well these models perform on topics relevant to other cultures, whose presence on the web is not that prominent. To address this gap, we introduce BertaQA, a multiple-choice trivia dataset that is parallel in English and Basque. The dataset consists of a local subset with questions pertinent to the Basque culture, and a global subset with questions of broader interest. We find that state-of-the-art LLMs struggle with local cultural knowledge, even as they excel on global topics. However, we show that continued pre-training in Basque significantly improves the models' performance on Basque culture, even when queried in English. To our knowledge, this is the first solid evidence of knowledge transfer from a low-resource to a high-resource language. Our analysis sheds light on the complex interplay between language and knowledge, and reveals that some prior findings do not fully hold when reassessed on local topics. Our dataset and evaluation code are available under open licenses at https://github.com/juletx/BertaQA.
Dataset: https://huggingface.co/datasets/HiTZ/BertaQA
Paper: https://arxiv.org/abs/2406.07302
Run the following examples_statistics.ipynb jupyter notebook in analysis folder to get examples and statistics of the dataset:
You will need to install LM Evaluation Harness. Clone the repository and install the requirements:
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .To run evaluation on open models, use the scripts in the scripts directory. Each script evaluates a model in all the tasks. For example, to run evaluation on Latxa v1.1 7b, run:
sbatch lm_eval_latxa-7b-v1.1.slurmEvaluation results are in the results directory. Each model has a directory with the results of the evaluation in each task. The results are in the form of a json file with the average scores of the model in each task.
To analyze the results, run the bertaqa.ipynb jupyter notebook in the analysis directory. This notebook will generate the tables of the paper.
Commercial models from OpenAI and Anthropic are evaluated using the respective APIs. The evaluation scripts are in the openai and anthropic directories. The evaluation results are in the results directory.
To run evaluation on OpenAI models, use the scripts in the openai directory. There is a python script to evaluate each dataset, and a bash script for each model and dataset. For example, to run evaluation on GPT-3.5 Turbo on EusTrivia, run:
bash gpt-3.5-turbo-0125_eus_trivia.shEvaluation results are in the results directory. Each model has a directory with the results of the evaluation in each task. In this case, all the outputs of the models are saved for each task. Scores can be calculated using the correct field. For EusTrivia and EusExams, there are additional scripts to obtained detailed results by category.
To analyze the results, run the bertaqa_openai.ipynb and bertaqa_anthropic.ipynb jupyter notebooks in the analysis directory. These notebooks will generate the tables of the paper.
We use the HuggingFace Transformers to translate the datasets. Translation scripts are in the translate directory. There is a folder for each model with the translation scripts that were used to generate the results in the paper. The resulting translated datasets are available in HuggingFace: https://huggingface.co/HiTZ/BertaQA.
- The script
dataset.pycontains the dataset classes. - The script
dataset_configs.pycontains the dataset configuration. - We use the
translate_dataset_nllb.pyscript to translate the datasets with NLLB. This script uses thetranslate.pyscript to translate each field of the dataset. - The
translate_dataset_few_shot.pyscript is used to translate the datasets with XGLM. This script uses thetranslate_few_shot.pyscript to translate each field of the dataset.
For example, to translate the EusTrivia dataset to English using NLLB-200-3.3B, run:
sbatch translate_bertaqa_nllb.slurmFor example, to self-translate the EusTrivia dataset using Latxa v1.1 7b, run:
sbatch translate_bertaqa_latxa-7b-v1.1.slurm@inproceedings{NEURIPS2024_3bb42f6b,
author = {Etxaniz, Julen and Azkune, Gorka and Soroa, Aitor and de Lacalle, Oier Lopez and Artetxe, Mikel},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
pages = {34077--34097},
publisher = {Curran Associates, Inc.},
title = {BertaQA: How Much Do Language Models Know About Local Culture?},
url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/3bb42f6bb1b1ab6809afd6c90865b087-Paper-Datasets_and_Benchmarks_Track.pdf},
volume = {37},
year = {2024}
}