Comments (4)
At the moment we focus on training models for the Tatoeba MT Challenge that we released recently (https://github.com/Helsinki-NLP/Tatoeba-Challenge). There will be some updated models there. Check it out. Otherwise, we will continue updating existing language pairs but progress may be slow as training requires a lot of resources and time. I cannot promise new models frequently.
from opus-mt-train.
And, yes, the trick to improve models is to train more. SentencePiece based segmentation is also useful and some other smallish improvements in data pre-processing.
from opus-mt-train.
Oo, great! Very thanks again for the Tatoeba-Challenge project! Recently you published a Spanish-to-English and other models that we need!
By the way, about the pre-processing step for OPUS datasets. Maybe you read facebook's article: https://arxiv.org/pdf/1907.06616.pdf (Facebook FAIR’s WMT19 News Translation Task Submission). There are two important steps there:
- applying language identification filtering. it can be CLD2 library, for example.
- removing sentence pairs with a source/target length ratio exceeding 1.5
And, of course, back-translation. I noticed that you do something with back-translation. There is another facebook article with details: https://arxiv.org/abs/1808.09381. Only this step allows them to improve BLUE on 4 points.
from opus-mt-train.
Yes, I do apply language identification in the new Tatoeba-MT models and some other basic filtering. Length-ratio filtering has always been part of the pipeline. This is a very well-known since old SMT times and Moses tools. However, I am not as strict as the paper suggests. There is a lot of hyper-parameters that can be optimized for each language pair. Backtranslation is part of all models that include "+bt" in their string. I need to stress that the OPUS-MT models are not tuned towards news translation from the WMT tests. It is not surprising if their are performance differences as simple domain-adaptation boosts the performance a lot. I will try to also include some fine-tuned models later. A finetuning framework is already integrated in OPUS-MT
By the way, it's a bit funny that most people point to Facebook/Google papers when they refer to techniques developed and proposed by researchers in academia. I guess that universities have to improve their PR units ...
from opus-mt-train.
Related Issues (20)
- What's the dataset used for training opus-mt-en-de HOT 1
- Language Code Difference HOT 1
- What is tatoeba-langtune? HOT 2
- Preprocessing Script Question
- Korean Finetuning
- Multilingual Tuned Model Translating everything to "sssssssss" HOT 2
- What could cause widely varying inference time when using pre-trained opus-mt-en-fr model with python transformers library? HOT 2
- Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model
- How to translate from english to Japan?
- Using OPUS-MT with DeepSpeed
- update Dockerfile.gpu--fixed
- different sizes of dictionaries in different models HOT 1
- Reproduced crash on Opus-mt-en-de model using string "J" and "J-10" HOT 1
- Unable to find current origin/master revision in submodule path HOT 2
- Hyperparameters used for pretrained models? HOT 1
- how to train our dataset HOT 3
- Unbelievably High BLEU scores from finetuning... HOT 3
- Data for Brazilian Portuguese HOT 2
- Lack of transparency on used training data. - Does finetuning make sense? HOT 1
- preprocess.sh [: ==: unary operator expected HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opus-mt-train.