Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language
Authors: Jesus Alvarez C, Daua Karajeanes, Ashley Prado, John Ruttan, Ivory Yang, Sean O’Brien, Vasu Sharma, Kevin Zhu
Explore how we accelerate Comanche NLP by combining synthetic text pipelines and language ID to overcome data scarcity in endangered languages.
🔗 Read the full paper (AmericasNLP 2025)
git clone https://github.com/comanchegenerate/ComancheSynthetic.git
cd ComancheSynthetic- Datasets/: 412 phrase Comanche-English corpus, the first for this language.
- comanche_synthetic_generation.py: Generate validated synthetic Comanche text via GPT-4 few-shot prompting.
- language_identification.ipynb: Language identification experimentation showing effectiveness of few-shot examples on increasing accuracy.
Feedback and pull requests welcome!