Skip to content

An experimental open-source Large Language Model for the Manx Gaelic (Gaelg) language

License

Notifications You must be signed in to change notification settings

Jade-RM/manx-gaelic-llm

Repository files navigation

title emoji colorFrom colorTo sdk sdk_version app_file pinned
manx-gaelic-llm
🚀
red
indigo
gradio
5.49.1
tiny_llm_with_gradio_interface.py
false

manx-gaelic-llm

An open-source (tiny) Large Language Model for the Manx (Gaelg) language

This project aims to support and promote the revitalisation of the Manx language and to aid in learning this language. It is intended as a text generation tool and, as the project progresses, a conversation partner with which to have very simple conversations so that learners at beginner/advanced beginner level can practice what they have learned during lessons.

After searching, it was found that no such tool exists for use in this language to date, and therefore this model could fill a gap.

Status: The project contains a tiny LLM with a BPE tokenizer and a small corpus of ~1400 sentences and a vocabulary of ~400. The tiny LLM has been trained on the corpus. Gradio has been added to the code so that an interface is available to chat with the LLM. The LLM gives relevant responses most of the time when asked simple questions or presented with simple statements. Sometimes it answers creatively (not sentences lifted directly from the corpus) and it has also asked simple questions a couple of times during conversations. While earlier it worked as a sentence generator, it is no longer able to complete sentences. It does, however, now function as a simple conversation partner. It can discuss topics such as likes and dislikes, feelings (e.g., ta mee skee, ta mee maynrey...), simple daily activities (e.g., ta mee roie, ta mee gobbragh...), simple activities in the past (e.g., ren mee gobbragh jea), and basic information such as where one lives, what pets one has etc. It is currently a research project and not ready for public use, but with expanded data it could soon become useful as a tool for beginner conversation practice. This repo contains all the code and data to train the model offline, as well as the files necessary to deploy the model on Hugging Face. A space has also been created for the demo model on Hugging Face. A full, open-source pipeline is planned.

Sources for the corpus: The corpus has been based on my own learning notes and inspired by the textbook Loayr Gaelg! Keim Nane and beginner lessons on the website learnmanx.com. At this stage the corpus is experimental. It only includes vocabulary and grammar structures taught at Level 1 (Keim Nane) and is intended as an aid for learners at this level. As the corpus grows, I plan to expand it to include language taught at Level 2 (Keim Jees) All data has been created by me and any sentences inspired by the above sources have been rephrased or paraphrased. However, I hope for help from and collaboration with more fluent Manx speakers as the project grows. I have written and curated the corpus myself so that it contains the specific vocabulary and grammar structures that student learn at certain levels, and so that it is not contaminated with words or phrases from other languages (unless commonly used in Manx). The LLM is intended to be monolingual.

Further development: The most important task at this stage is to expand the corpus at Keim Nane level so that the LLM becomes more conversational, more creative and more accurate. As the project progresses, my aim is to release small demos and seek collaboration with members of the Manx-speaking community. In addition to this, I am focusing on steps to expand and improve the model, tokenizer and final application. The model is functional and it is currently more important to priorise the development and expansion of the corpus, but I aware that much can be done to improve it as the project progresses.

As well as my own learning notes, I used the following language resources as inspiration and as aids to help me create a corpus:

Here is a list of some resources and aids which have been useful in helping me with ideas on how to best build the transformer (I built this transformer from scratch and am continually looking at ways to improve it):

About

An experimental open-source Large Language Model for the Manx Gaelic (Gaelg) language

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Languages