🦖 T-Ragx

T-Ragx Featured Image

Enhancing Translation with RAG-Powered Large Language Models

T-Ragx Demo:

TL;DR

Overview

Open-source system-level translation framework
Provides fluent and natural translations utilizing LLMs
Ensures privacy and security with local translation processes
Capable of zero-shot in-task translations

Methods

Utilizes QLoRA fine-tuned models for enhanced accuracy
Employs both general and in-task specific translation memories and glossaries
Incorporates preceding text in document-level translations for improved context understanding

Results

Combining QLoRA with in-task translation memory and glossary resulted in ~45% increase in aggregated WMT23 translation scores, benchmarked against the Mistral 7b Instruct model
Demonstrated high recall for valid translation memories and glossaries, including previous translations and character names
Surpassed the performance of the native TowerInstruct model in three (Ja<->En, Zh->En) out of the four WMT23 language direction tested
Outperformed DeepL in translating the Japanese web novel "That Time I Got Reincarnated as a Slime" into Chinese using in-task RAG
- Japanese to Chinese translation improvements:
  - +29% sacrebleu
  - +0.4% comet22

👉See the write-up for more details📜

Getting Started

Install

Simply run:

pip install t-ragx

or if you are feeling lucky:

pip install git+https://github.com/rayliuca/T-Ragx.git

Elasticsearch

See the wiki page instructions

Note: you can access preview read-only T-Ragx Elasticsearch services at https://t-ragx-fossil.rayliu.ca and https://t-ragx-fossil2.rayliu.ca (But you will need a personal Elasticsearch service to add your in-task memories)

Environment

(Recommended) Conda / Mamba

Download the conda environment.yml file and run:

conda env create -f environment.yml

## or with mamba
# mamba env create -f environment.yml

Which will crate a t_ragx environment that's compatible with this project

pip

Download the requirment.txt file and run:

Use your favourite virtual environment, and run:

pip install -r requirment.txt

Examples

Initiate the input processor:

import t_ragx

# Initiate the input processor which will retrieve the memory and glossary results for us
input_processor = t_ragx.Processors.ElasticInputProcessor()

# Load/ point to the demo resources
input_processor.load_general_glossary("https://t-ragx-public.s3.us-west-004.backblazeb2.com/t-ragx-public/glossary")
input_processor.load_general_translation(elasticsearch_host=["https://t-ragx-fossil.rayliu.ca", "https://t-ragx-fossil2.rayliu.ca"])

Using the llama-cpp-python backend:

import t_ragx

# T-Ragx currently support 
# Huggingface transformers: MistralModel, InternLM2Model
# Ollama API: OllamaModel
# OpenAI API: OpenAIModel
# Llama-cpp-python backend: LlamaCppPythonModel
mistral_model = t_ragx.models.LlamaCppPythonModel(
    repo_id="rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2",
    filename="*Q4_K_M*",
    # see https://huggingface.co/rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
    # for other files
    chat_format="mistral-instruct",
    model_config={'n_ctx':2048}, # increase the context window
)

t_ragx_translator = t_ragx.TRagx([mistral_model], input_processor=input_processor)

Translate!

t_ragx_translator.batch_translate(
    source_text_list,  # the input text list to translate
    pre_text_list=pre_text_list,  # optional, including the preceding context to translate the document level
    # Can generate via:
    # pre_text_list = t_ragx.utils.helper.get_preceding_text(source_text_list, max_sent=3)
    source_lang_code='ja',
    target_lang_code='en',
    memory_search_args={'top_k': 3}  # optional, pass additional arguments to input_processor.search_memory
)

Models

Note: you could use any LLMs by using the API models (i.e. OllamaModel or OpenAIModel) or extending the t_ragx.models.BaseModel class

The following models were finetuned using the T-Ragx prompts, so they might work a bit better than some of the off-the-shelve models with T-Ragx

QLoRA Models:

Source Model	Model Type	Quantization	Fine-tuned Model
mistralai/Mistral-7B-Instruct-v0.2	LoRA		rayliuca/TRagx-Mistral-7B-Instruct-v0.2
	merged AWQ	AWQ	rayliuca/TRagx-AWQ-Mistral-7B-Instruct-v0.2
	merged GGUF	Q3_K, Q4_K_M, Q5_K_M, Q5_K_S, Q6_K, F32	rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
mlabonne/NeuralOmniBeagle-7B	LoRA		rayliuca/TRagx-NeuralOmniBeagle-7B
	merged AWQ	AWQ	rayliuca/TRagx-AWQ-NeuralOmniBeagle-7B
	merged GGUF	Q3_K, Q4_K_M, Q5_K_M, Q5_K_S, Q6_K, F32	rayliuca/TRagx-GGUF-NeuralOmniBeagle-7B
internlm/internlm2-7b	LoRA		rayliuca/TRagx-internlm2-7b
	merged GPTQ	GPTQ	rayliuca/TRagx-GPTQ-internlm2-7b
Unbabel/TowerInstruct-7B-v0.2	LoRA		rayliuca/TRagx-TowerInstruct-7B-v0.2

Data Sources

All of the datasets used in the project

Dataset	Translation Memory	Glossary	Training	Testing	License
OpenMantra	✅		✅		CC BY-NC 4.0
WMT < 2023	✅		✅		for research
ParaMed	✅		✅		cc-by-4.0
ted_talks_iwslt	✅		✅		cc-by-nc-nd-4.0
JESC	✅		✅		CC BY-SA 4.0
MTNT	✅				Custom/ Reddit API
WCC-JC	✅		✅		for research
ASPEC			✅		custom, for research
All other ja-en/zh-en OPUS data	✅				mix of open licenses: check https://opus.nlpl.eu/
Wikidata		✅			CC0
Tensei Shitara Slime Datta Ken Wiki		☑️ in task			CC BY-SA
WMT 2023				✅	for research
Tensei Shitara Slime Datta Ken Web Novel & web translations	☑️ in task			✅	Not used for training or redistribution

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
reports		reports
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🦖 T-Ragx

Enhancing Translation with RAG-Powered Large Language Models

TL;DR

Overview

Methods

Results

Getting Started

Install

Elasticsearch

Environment

(Recommended) Conda / Mamba

pip

Examples

Models

QLoRA Models:

Data Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

rayliuca/T-Ragx

Folders and files

Latest commit

History

Repository files navigation

🦖 T-Ragx

Enhancing Translation with RAG-Powered Large Language Models

TL;DR

Overview

Methods

Results

Getting Started

Install

Elasticsearch

Environment

(Recommended) Conda / Mamba

pip

Examples

Models

QLoRA Models:

Data Sources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages