Git Product home page Git Product logo

t-ragx's Introduction

๐Ÿฆ– T-Ragx

T-Ragx Featured Image

Enhancing Translation with RAG-Powered Large Language Models


T-Ragx Demo: Open In Colab

TL;DR

Overview

  • Open-source system-level translation framework
  • Provides fluent and natural translations utilizing LLMs
  • Ensures privacy and security with local translation processes
  • Capable of zero-shot in-task translations

Methods

  • Utilizes QLoRA fine-tuned models for enhanced accuracy
  • Employs both general and in-task specific translation memories and glossaries
  • Incorporates preceding text in document-level translations for improved context understanding

Results

  • Combining QLoRA with in-task translation memory and glossary resulted in ~45% increase in aggregated WMT23 translation scores, benchmarked against the Mistral 7b Instruct model
  • Demonstrated high recall for valid translation memories and glossaries, including previous translations and character names
  • Surpassed the performance of the native TowerInstruct model in three (Ja<->En, Zh->En) out of the four WMT23 language direction tested
  • Outperformed DeepL in translating the Japanese web novel "That Time I Got Reincarnated as a Slime" into Chinese using in-task RAG
    • Japanese to Chinese translation improvements:
      • +29% sacrebleu
      • +0.4% comet22

๐Ÿ‘‰See the write-up for more details๐Ÿ“œ

Getting Started

Install

Simply run:

pip install t-ragx

or if you are feeling lucky:

pip install git+https://github.com/rayliuca/T-Ragx.git

Elasticsearch

See the wiki page instructions

Note: you can access preview read-only T-Ragx Elasticsearch services at https://t-ragx-fossil.rayliu.ca and https://t-ragx-fossil2.rayliu.ca (But you will need a personal Elasticsearch service to add your in-task memories)

Environment

(Recommended) Conda / Mamba

Download the conda environment.yml file and run:

conda env create -f environment.yml

## or with mamba
# mamba env create -f environment.yml

Which will crate a t_ragx environment that's compatible with this project

pip

Download the requirment.txt file and run:

Use your favourite virtual environment, and run:

pip install -r requirment.txt

Examples

Initiate the input processor:

import t_ragx

# Initiate the input processor which will retrieve the memory and glossary results for us
input_processor = t_ragx.Processors.ElasticInputProcessor()

# Load/ point to the demo resources
input_processor.load_general_glossary("https://l8u0.c18.e2-1.dev/t-ragx-public/glossary")
input_processor.load_general_translation(elasticsearch_host=["https://t-ragx-fossil.rayliu.ca", "https://t-ragx-fossil2.rayliu.ca"])

Using the llama-cpp-python backend:

import t_ragx

# T-Ragx currently support 
# Huggingface transformers: MistralModel, InternLM2Model
# Ollama API: OllamaModel
# OpenAI API: OpenAIModel
# Llama-cpp-python backend: LlamaCppPythonModel
mistral_model = t_ragx.models.LlamaCppPythonModel(
    repo_id="rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2",
    filename="*Q4_K_M*",
    # see https://huggingface.co/rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
    # for other files
    chat_format="mistral-instruct",
    model_config={'n_ctx':2048}, # increase the context window
)

t_ragx_translator = t_ragx.TRagx([mistral_model], input_processor=input_processor)

Translate!

t_ragx_translator.batch_translate(
    source_text_list,  # the input text list to translate
    pre_text_list=pre_text_list,  # optional, including the preceding context to translate the document level
    # Can generate via:
    # pre_text_list = t_ragx.utils.helper.get_preceding_text(source_text_list, max_sent=3)
    source_lang_code='ja',
    target_lang_code='en',
    memory_search_args={'top_k': 3}  # optional, pass additional arguments to input_processor.search_memory
)

Models

Note: you could use any LLMs by using the API models (i.e. OllamaModel or OpenAIModel) or extending the t_ragx.models.BaseModel class

The following models were finetuned using the T-Ragx prompts, so they might work a bit better than some of the off-the-shelve models with T-Ragx

QLoRA Models:

Source Model Model Type Quantization Fine-tuned Model
mistralai/Mistral-7B-Instruct-v0.2 LoRA rayliuca/TRagx-Mistral-7B-Instruct-v0.2
merged AWQ AWQ rayliuca/TRagx-AWQ-Mistral-7B-Instruct-v0.2
merged GGUF Q3_K, Q4_K_M, Q5_K_M, Q5_K_S, Q6_K, F32 rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
mlabonne/NeuralOmniBeagle-7B LoRA rayliuca/TRagx-NeuralOmniBeagle-7B
merged AWQ AWQ rayliuca/TRagx-AWQ-NeuralOmniBeagle-7B
merged GGUF Q3_K, Q4_K_M, Q5_K_M, Q5_K_S, Q6_K, F32 rayliuca/TRagx-GGUF-NeuralOmniBeagle-7B
internlm/internlm2-7b LoRA rayliuca/TRagx-internlm2-7b
merged GPTQ GPTQ rayliuca/TRagx-GPTQ-internlm2-7b
Unbabel/TowerInstruct-7B-v0.2 LoRA rayliuca/TRagx-TowerInstruct-7B-v0.2

Data Sources

All of the datasets used in the project

Dataset Translation Memory Glossary Training Testing License
OpenMantra โœ… โœ… CC BY-NC 4.0
WMT < 2023 โœ… โœ… for research
ParaMed โœ… โœ… cc-by-4.0
ted_talks_iwslt โœ… โœ… cc-by-nc-nd-4.0
JESC โœ… โœ… CC BY-SA 4.0
MTNT โœ… Custom/ Reddit API
WCC-JC โœ… โœ… for research
ASPEC โœ… custom, for research
All other ja-en/zh-en OPUS data โœ… mix of open licenses: check https://opus.nlpl.eu/
Wikidata โœ… CC0
Tensei Shitara Slime Datta Ken Wiki โ˜‘๏ธ in task CC BY-SA
WMT 2023 โœ… for research
Tensei Shitara Slime Datta Ken Web Novel & web translations โ˜‘๏ธ in task โœ… Not used for training or redistribution

t-ragx's People

Contributors

rayliuca avatar snyk-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

t-ragx's Issues

ValueError: Either 'hosts' or 'cloud_id' must be specified

I ran the code in readme.md

import t_ragx

# Initiate the input processor which will retrieve the memory and glossary results for us
input_processor = t_ragx.Processors.ElasticInputProcessor()

# Load/ point to the demo resources
input_processor.load_general_glossary("https://l8u0.c18.e2-1.dev/t-ragx-public/glossary")
input_processor.load_general_translation(elasticsearch_host=["https://t-ragx-fossil.rayliu.ca", "https://t-ragx-fossil2.rayliu.ca"])

The error is below:

File [~/autodl-tmp/minicoda3/envs/t_ragx/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:196](https://vscode-remote+ssh-002dremote-002bconnect-002ewestc-002egpuhub-002ecom.vscode-resource.vscode-cdn.net/root/autodl-tmp/kdy/finetune/use_tragx/~/autodl-tmp/minicoda3/envs/t_ragx/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:196), in Elasticsearch.__init__(self, hosts, cloud_id, api_key, basic_auth, bearer_auth, opaque_id, headers, connections_per_node, http_compress, verify_certs, ca_certs, client_cert, client_key, ssl_assert_hostname, ssl_assert_fingerprint, ssl_version, ssl_context, ssl_show_warn, transport_class, request_timeout, node_class, node_pool_class, randomize_nodes_in_pool, node_selector_class, dead_node_backoff_factor, max_dead_node_backoff, serializer, serializers, default_mimetype, max_retries, retry_on_status, retry_on_timeout, sniff_on_start, sniff_before_requests, sniff_on_node_failure, sniff_timeout, min_delay_between_sniffing, sniffed_node_callback, meta_header, timeout, randomize_hosts, host_info_callback, sniffer_timeout, sniff_on_connection_fail, http_auth, maxsize, _transport)
    [133](https://vscode-remote+ssh-002dremote-002bconnect-002ewestc-002egpuhub-002ecom.vscode-resource.vscode-cdn.net/root/autodl-tmp/kdy/finetune/use_tragx/~/autodl-tmp/minicoda3/envs/t_ragx/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:133) def __init__(
    [134](https://vscode-remote+ssh-002dremote-002bconnect-002ewestc-002egpuhub-002ecom.vscode-resource.vscode-cdn.net/root/autodl-tmp/kdy/finetune/use_tragx/~/autodl-tmp/minicoda3/envs/t_ragx/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:134)     self,
    [135](https://vscode-remote+ssh-002dremote-002bconnect-002ewestc-002egpuhub-002ecom.vscode-resource.vscode-cdn.net/root/autodl-tmp/kdy/finetune/use_tragx/~/autodl-tmp/minicoda3/envs/t_ragx/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:135)     hosts: t.Optional[_TYPE_HOSTS] = None,
   (...)
    [193](https://vscode-remote+ssh-002dremote-002bconnect-002ewestc-002egpuhub-002ecom.vscode-resource.vscode-cdn.net/root/autodl-tmp/kdy/finetune/use_tragx/~/autodl-tmp/minicoda3/envs/t_ragx/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:193)     _transport: t.Optional[Transport] = None,
    [194](https://vscode-remote+ssh-002dremote-002bconnect-002ewestc-002egpuhub-002ecom.vscode-resource.vscode-cdn.net/root/autodl-tmp/kdy/finetune/use_tragx/~/autodl-tmp/minicoda3/envs/t_ragx/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:194) ) -> None:
    [195](https://vscode-remote+ssh-002dremote-002bconnect-002ewestc-002egpuhub-002ecom.vscode-resource.vscode-cdn.net/root/autodl-tmp/kdy/finetune/use_tragx/~/autodl-tmp/minicoda3/envs/t_ragx/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:195)     if hosts is None and cloud_id is None and _transport is None:
--> [196](https://vscode-remote+ssh-002dremote-002bconnect-002ewestc-002egpuhub-002ecom.vscode-resource.vscode-cdn.net/root/autodl-tmp/kdy/finetune/use_tragx/~/autodl-tmp/minicoda3/envs/t_ragx/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:196)         raise ValueError("Either 'hosts' or 'cloud_id' must be specified")
    [198](https://vscode-remote+ssh-002dremote-002bconnect-002ewestc-002egpuhub-002ecom.vscode-resource.vscode-cdn.net/root/autodl-tmp/kdy/finetune/use_tragx/~/autodl-tmp/minicoda3/envs/t_ragx/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:198)     if timeout is not DEFAULT:
    [199](https://vscode-remote+ssh-002dremote-002bconnect-002ewestc-002egpuhub-002ecom.vscode-resource.vscode-cdn.net/root/autodl-tmp/kdy/finetune/use_tragx/~/autodl-tmp/minicoda3/envs/t_ragx/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:199)         if request_timeout is not DEFAULT:

ValueError: Either 'hosts' or 'cloud_id' must be specified

I ran this code block again, it reported another error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], [line 4](vscode-notebook-cell:?execution_count=2&line=4)
      [1](vscode-notebook-cell:?execution_count=2&line=1) import t_ragx
      [3](vscode-notebook-cell:?execution_count=2&line=3) # Initiate the input processor which will retrieve the memory and glossary results for us
----> [4](vscode-notebook-cell:?execution_count=2&line=4) input_processor = t_ragx.Processors.ElasticInputProcessor()
      [6](vscode-notebook-cell:?execution_count=2&line=6) # Load/ point to the demo resources
      [7](vscode-notebook-cell:?execution_count=2&line=7) input_processor.load_general_glossary("https://l8u0.c18.e2-1.dev/t-ragx-public/glossary")

AttributeError: module 't_ragx' has no attribute 'Processors'

I am not familiar with ES, so I don't know the way I initialize T-Ragx Elasticsearch services is righrt or not.
Another question, What should I do if I want to start my own Elasticsearch services? Do you have any guidance please?

Unable to access snapshot repository

Hi, I really liked playing around with T-Ragx and am now trying to self-host. I am trying to set up a local instance of the translation memory according to the guide. However, I can't seem to access o3t0.or.idrivee2-37.com. Full error message:

Error: {"error":{"root_cause":[{"type":"repository_exception","reason":"[public_t_ragx_translation_memory] Could not determine repository generation from root blobs"}],"type":"repository_exception","reason":"[public_t_ragx_translation_memory] Could not determine repository generation from root blobs","caused_by":{"type":"i_o_exception","reason":"Exception when listing blobs by prefix [index-]","caused_by":{"type":"sdk_client_exception","reason":"Unable to execute HTTP request: t-ragx-public.o3t0.or.idrivee2-37.com","caused_by":{"type":"unknown_host_exception","reason":"t-ragx-public.o3t0.or.idrivee2-37.com"}}}},"status":500} 

Is the service still available? If not, could you provide documentation on what indexes are required (and sample data, maybe in CSV format). Or could the error be on my side?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.