Git Product home page Git Product logo

ikergarcia1996 / easy-translate Goto Github PK

View Code? Open in Web Editor NEW
181.0 181.0 286.0 665 KB

Easy-Translate is a script for translating large text files with a SINGLE COMMAND. Easy-Translate is designed to be as easy as possible for beginners and as seamlesscustomizable and as possible for advanced users.

License: Apache License 2.0

Python 100.00%
4-bit 8-bit begginers cpu easy easy-to-use gpu hugginface hugginface-hub huggingface-transformers llm m2m100 machine-translation nllb200 prompt pytorch quantization transformers translation

easy-translate's People

Contributors

alquist4121 avatar ikergarcia1996 avatar kalebu avatar ruanchaves avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

easy-translate's Issues

? Useable with variables, not files?

Hello,
I have a great many headlines and articles, most not in English, stored in a database, which I can store in python lists.
I would like to loop through the lists, and translate each variable, without writing the contents to a file first.
This is part of a website, which displays a great many headlines, to start, for the users to choose from.
Writing each headline to a file, translate, and read the file back to a list will take too much time.
Thank you, in advance, for your help.
Baruch

OSError: It looks like the config file at 'models/pytorch_model.bin' is not a valid JSON file

Hello,
Tested with Debian 11/12, cuda 11.7/11.8, different models, different precisions,with and without accel, etc. Other projects based on torch and transformers work well on the same machine.

I have these errors when running the script:
`python3 translate.py --sentences_path sample_text/en.txt --output_path sample_text/en2es.translation.m2m100_1.2B.txt --source_lang en --target_lang es --model_name models/pytorch_model.bin
Loading model from models/pytorch_model.bin
Traceback (most recent call last):
File "Easy-Translate/.env/lib/python3.11/site-packages/transformers/configuration_utils.py", line 702, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "Easy-Translate/.env/lib/python3.11/site-packages/transformers/configuration_utils.py", line 793, in _dict_from_json_file
text = reader.read()
^^^^^^^^^^^^^
File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "Easy-Translate/translate.py", line 443, in
main(
File "Easy-Translate/translate.py", line 115, in main
model, tokenizer = load_model_for_inference(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "Easy-Translate/model.py", line 75, in load_model_for_inference
config = AutoConfig.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "Easy-Translate/.env/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 983, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "Easy-Translate/.env/lib/python3.11/site-packages/transformers/configuration_utils.py", line 617, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "Easy-Translate/.env/lib/python3.11/site-packages/transformers/configuration_utils.py", line 705, in _get_config_dict
raise EnvironmentError(

OSError: It looks like the config file at 'models/pytorch_model.bin' is not a valid JSON file.`

Small 100 not working anymore.

I was previously using this from sample and its no longer works now after updates.
python3 translate.py \ --sentences_path sample_text/en.txt \ --output_path sample_text/en2es.translation.small100.txt \ --source_lang en \ --target_lang es \ --model_name alirezamsh/small100

This is the log:
2023-12-06T12:38:42.665144096Z Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2023-12-06T12:38:43.268330478Z The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
2023-12-06T12:38:43.268375418Z The tokenizer class you load from this checkpoint is 'M2M100Tokenizer'.
2023-12-06T12:38:43.268383058Z The class this function is called from is 'SMALL100Tokenizer'.
2023-12-06T12:38:43.269837017Z Loading model from alirezamsh/small100
2023-12-06T12:38:43.269851767Z Loading custom small100 tokenizer for utils.tokenization_small100
2023-12-06T12:38:43.270022978Z Traceback (most recent call last):
2023-12-06T12:38:43.270132589Z File "//Easy-Translate/translate.py", line 538, in
2023-12-06T12:38:43.270510501Z main(
2023-12-06T12:38:43.270625462Z File "//Easy-Translate/translate.py", line 134, in main
2023-12-06T12:38:43.270820843Z model, tokenizer = load_model_for_inference(
2023-12-06T12:38:43.270937283Z File "/Easy-Translate/model.py", line 90, in load_model_for_inference
2023-12-06T12:38:43.271180565Z tokenizer: PreTrainedTokenizerBase = AutoTokenizer.from_pretrained(
2023-12-06T12:38:43.271340276Z File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2045, in from_pretrained
2023-12-06T12:38:43.271888289Z return cls._from_pretrained(
2023-12-06T12:38:43.271961479Z File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
2023-12-06T12:38:43.272524103Z tokenizer = cls(*init_inputs, **init_kwargs)
2023-12-06T12:38:43.272535813Z File "/Easy-Translate/utils/tokenization_small100.py", line 153, in init
2023-12-06T12:38:43.272685754Z super().init(
2023-12-06T12:38:43.272714444Z File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py", line 366, in init
2023-12-06T12:38:43.272931705Z self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
2023-12-06T12:38:43.272935405Z File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py", line 462, in _add_tokens
2023-12-06T12:38:43.273221467Z current_vocab = self.get_vocab().copy()
2023-12-06T12:38:43.273318857Z File "/Easy-Translate/utils/tokenization_small100.py", line 289, in get_vocab
2023-12-06T12:38:43.273603199Z vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
2023-12-06T12:38:43.273645449Z File "/Easy-Translate/utils/tokenization_small100.py", line 192, in vocab_size
2023-12-06T12:38:43.273817970Z return len(self.encoder) + len(self.lang_token_to_id) + self.num_madeup_words
2023-12-06T12:38:43.274002501Z AttributeError: 'SMALL100Tokenizer' object has no attribute 'encoder'. Did you mean: 'encode'?

This model own space has same error:
https://huggingface.co/spaces/alirezamsh/small100

How to translate subtitle .srt

I use this command

python3 translate.py \
--sentences_path input.srt \
--output_path result.srt \
--source_lang eng_Latn \
--target_lang ind_Latn \
--model_name facebook/nllb-200-distilled-600M \
--precision fp16

with input.srt

1
00:00:07,312 --> 00:00:09,993
Hello.

2
00:00:09,994 --> 00:00:11,227
Where are you right now?

3
00:00:11,228 --> 00:00:13,360
Right now I am on my way
to South Dakota.

4
00:00:13,361 --> 00:00:16,093
Gonna do a little camping,
do a little fishing.

5
00:00:16,094 --> 00:00:17,426
Good for you, Colter.

but the result.srt has problems:

  • wrong order
  • empty line replace with (dalam bahasa Inggris)
  • appended unknown
1
00:00:07,312 --> 00:00:09,993
Hei, apa yang kau lakukan?
(dalam bahasa Inggris) <-- this should be empty line
2 (satu) <-- the '(satu)' should not be exist
00:00:09,994 --> 00:00:11,227
Di mana kau sekarang?
(dalam bahasa Inggris) ....
3 Pemberantasan Korupsi <-- this also should not be exist
00:00:11,228 --> 00:00:13,360
Saat ini aku sedang dalam perjalanan
ke Dakota Selatan.
(dalam bahasa Inggris) ...
4
00:00:13,361 --> 00:00:16,093
Akan pergi berkemah sedikit,
lakukan sedikit memancing.
(dalam bahasa Inggris) ...
5
00:00:16,094 --> 00:00:17,426
Bagus untukmu, Colter.
(dalam bahasa Inggris) ...

API plans?

Any plans to add a simple API even on a server running from Python?

BaGRoS

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.