helsinki-nlp / mammoth Goto Github PK
View Code? Open in Web Editor NEWMAMMOTH: MAssively Multilingual Modular Open Translation @ Helsinki
Home Page: https://helsinki-nlp.github.io/mammoth/
License: MIT License
MAMMOTH: MAssively Multilingual Modular Open Translation @ Helsinki
Home Page: https://helsinki-nlp.github.io/mammoth/
License: MIT License
Plugging back the early stopper should be as simple as uncommenting its call in the validate(...)
method of the trainer.
In the long run, rather than having our custom transforms for data cleaning (as suggested b y #13), it would be better to leave it to a relevant third party, such as OpusFilter.
I'm mostly opening this issue for discussion and as a long-term project rather than expecting this to be handled soon.
As far as I can tell, one would need:
warmup
methodapply through e.g. a
popen.read`Note that there are several challenges ahead: in particular,
A model component (e.g. the Swahili encoder) is likely to exist on multiple devices. Because each device samples its own task sequence, it is possible that when a gradient synchronization point occurs, some of the devices have trained a task involving that component and thus have a gradient for its parameters ("are ready"), while other devices have not trained such a task ("not ready").
Currently synchronization of a component starts with a reduction of the ready_t
tensor, where each device sends a 1
if ready and a 0
if not. Summing the ready bits together provides the normalization for the gradient. The gradient itself is also reduced as a sum of one tensor per device, with unready devices sending a zero tensor as a dummy value for the missing gradient.
The additional reduction incurs a cost in terms of performance. To speed up training, the necessary information could be precomputed to avoid communication.
Currently each trainer process has one TaskQueueManager
(Note that dataloaders also have their own copy, but do not participate in the gradient synch). The TQM is aware of which components are present on the device it serves, and it is responsible for sampling the order in which tasks are trained. For this change, TQM must be aware of the global assignment of components to devices and the global sequence of tasks, rather than just local to the device itself.
Currently, items are sorted into buckets based on the sum of the source and target sequence lengths. This can lead to incorrect padding decisions.
Items should be sorted using a tuple of lengths, rather than a single sum of lengths. In principle, this would entail maintaining a 2d array of buckets and devising a systematic iteration through this array.
We have a number of things to clean up:
Some normalization in order:
opts
and opt
should be made consistent and distinct from optim
onmt.utils.distributed.py
should be its own submodule onmt.distributed
encoder
/ decoder
as separate modules, rather than grouping them into onmt.modules
(would clean up code for adapters, and would be a reasonable place to move attention_bridge.py
).enc_rnn_size
, model_dim
, dec_rnn_size
... should be normalized to a single, consistent variablepool_size
, bucket_size
When not all of the tasks have validation paths, I get the following stack trace after the first validation
-- Tracebacks above this line can probably be ignored --
Traceback (most recent call last):
File "/users/mickusti/onmtttest/onmt/utils/distributed.py", line 382, in consumer
process_fn(
File "/users/mickusti/onmtttest/onmt/train_single.py", line 214, in main
trainer.train(
File "/users/mickusti/onmtttest/onmt/trainer.py", line 292, in train
onmt.utils.distributed.only_ready_reduce_and_rescale_grads(params, group=group)
File "/users/mickusti/onmtttest/onmt/utils/distributed.py", line 207, in only_ready_reduce_and_rescale_grads
grads = [p.grad.data for p in params_with_grad]
File "/users/mickusti/onmtttest/onmt/utils/distributed.py", line 207, in <listcomp>
grads = [p.grad.data for p in params_with_grad]
AttributeError: 'NoneType' object has no attribute 'data'
Setting a shard_size different from 0 currently causes the model to duplicate the test set.
The simplest way to circumvent this at the moment is to set the flag -shard_size 0.
Currently, the src_feats_shard has been commented out: i.e., manual features can't be provided to the model for now.
tasks_per_communication_batch
to accum_count
n_samples
to 1 in the strategy & repeat the same value accum_count
times)Thanks for the work!
Is there any plan to focus on evaluations beyond traditional train/val/test splitting of data and running certain metrics? For example are there any plans to have model-based translation evaluation, i.e, a separate model trained solely to evaluate translation, possibly reference-free (similar to COMET). Or another language model instruction tuned to choose which translation is better (given several options) and explain the reasoning?
HPO needs a message that training has been completed, so that it can end the trial and begin another one.
Currently only validation results are written to the structured log file.
We should add a structured message indicating end of training.
freezing some of the modules would allow training adapter as actual adapters.
Ideally, this would entail introducing some mechanism to mark in the config specific layerstacks/adapaters as not requiring gradient.
To be confirmed, but we can probably just do a combination of the following to get the desired behavior:
has_grad
hook to these modelsmodule.requires_grad_(False)
Current actions for doc building are failing with the following error message:
Exception occurred:
File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/flask/helpers.py", line 16, in <module>
from werkzeug.urls import url_quote
ImportError: cannot import name 'url_quote' from 'werkzeug.urls' (/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/werkzeug/urls.py)
The full traceback has been saved in /tmp/sphinx-err-dybf5qvp.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [Makefile:20: html] Error 2
Error: Process completed with exit code 2.
Failures seem to be systematic since at least today, the last successful doc building was action 156, three days ago. Everything that was run today has been failing doc building,
Not entirely sure what's causing it; moreover I'm having a hard time reproducing it locally. My guess would be something related to dependencies.
According to Baziotis et al [1], the MASS objective [2] is superior to BART, for training on monolingual data using the denoising sequence auto-encoder approach.
BART (alreday implemented in mammoth) is easier to implement though, as it fits with the transform framework. MASS on the other hand requires:
Both MASS and BART use sampled noise, and must be applied online during training, as opposed to cleaning preprocessing (#13) that can also be applied offline.
Note that MASS is tricky to implement, while BART we already have. It may not be worth the effort to implement this.
[1] Baziotis et. al (2023) "When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale" https://arxiv.org/abs/2305.14124
[2] Song et. al (2019) "MASS: Masked Sequence to Sequence Pre-training for Language Generation" https://arxiv.org/abs/1905.02450
Current preliminary experiments suggest we have datapoints with repeated substrings; likely other quality issues are to be found down the line. It would be useful to means of removing unusable or garbage data (similar to what is done in OpusFilter).
This is something that can be done with transforms (same mechanism as in OpenNMT-py, since that's where it comes from).
Transform objects inherit from onmt.transforms.Transform
, and have:
apply
method, that takes as input an example
dict containing a src
and tgt
key and return:
None
(to signal the datapoint was dropped)src
and tgt
, for further preprocessing)add_option
and a _parse_opts
class methods, to declare config arguments and retrieve values as necessary@register_transform
decorator to ensure they are automatically discovered by the rest of the code.See also the declaration of FilterTooLongTransform
for a minimal example of how the filtertoolong
transform is declared in the code.
Ideally, for this enhancement, one would need to:
onmt/transforms/filtering.py
FilterTooLongTransform
in onmt/transforms/misc.py
to this new fileEven when using the filtertoolong
transform both in the beginning and the end of the transform pipeline, there still remains issues indicating that excessively long examples are not successfully removed.
Training may crash when embedding the input:
onmt.modules.embeddings.SequenceTooLongError: Sequence is 5124 but PositionalEncoding is limited to 5000. See max_len argument.
Also, the sentencepiece transform logs the following warnings, which also indicates very long and/or repetitive strings:
unigram_model.cc(494) LOG(WARNING) Too big agenda size 10004. Shrinking (round 134) down to 50.
One possible approach is to implement two different filtering transforms:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.