helsinki-nlp / mammoth Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 3.0 276.01 MB

MAMMOTH: MAssively Multilingual Modular Open Translation @ Helsinki

Home Page: https://helsinki-nlp.github.io/mammoth/

License: MIT License

Python 95.57% Shell 3.75% Perl 0.68%

mammoth's People

Contributors

Stargazers

Watchers

Forkers

waino

mammoth's Issues

Plugging stuff back in: Early stopper

Plugging back the early stopper should be as simple as uncommenting its call in the validate(...) method of the trainer.

Interfacing Mammoth and OpusFilter

In the long run, rather than having our custom transforms for data cleaning (as suggested b y #13), it would be better to leave it to a relevant third party, such as OpusFilter.

I'm mostly opening this issue for discussion and as a long-term project rather than expecting this to be handled soon.

As far as I can tell, one would need:

a contribution to OpusFilter to allow it to read and write through piping (or selecting some other tool that allows cleanup through piping)
a transform to handle passing the data from our iterators to the third party software for cleaning and back into the system:
- booting a pipe in the warmup method
- implement apply through e.g. a popen.read`
- a graceful closure when training ends.
(optionally) a deprecation of current filters if their are no longer needed or redundant.

Note that there are several challenges ahead: in particular,

one will have to find a way around sub-process creation in multi-node settings
it's unclear how much effort would be required to implement the required behavior on the third party side (OpusFilter or other).

Remove the need for communicating ready_t

A model component (e.g. the Swahili encoder) is likely to exist on multiple devices. Because each device samples its own task sequence, it is possible that when a gradient synchronization point occurs, some of the devices have trained a task involving that component and thus have a gradient for its parameters ("are ready"), while other devices have not trained such a task ("not ready").

Currently synchronization of a component starts with a reduction of the ready_t tensor, where each device sends a 1 if ready and a 0 if not. Summing the ready bits together provides the normalization for the gradient. The gradient itself is also reduced as a sum of one tensor per device, with unready devices sending a zero tensor as a dummy value for the missing gradient.

The additional reduction incurs a cost in terms of performance. To speed up training, the necessary information could be precomputed to avoid communication.

Currently each trainer process has one TaskQueueManager (Note that dataloaders also have their own copy, but do not participate in the gradient synch). The TQM is aware of which components are present on the device it serves, and it is responsible for sampling the order in which tasks are trained. For this change, TQM must be aware of the global assignment of components to devices and the global sequence of tasks, rather than just local to the device itself.

Task sampling must be deterministic. The easiest way to achieve this is to have a separate RNG for each task sequence, and seed each copy of it with a predetermined seed.
Each device must have a copy of the TQM for all devices, not just itself. Alternatively, TQM must be modified to track the global state, so that a single TQM instance can provide the necessary information.
Instead of communicating the ready bits, the normalization for each component should be retrieved from the TQM. Alternatively, sync of a component could be postponed until all devices are ready (dynamic accumulation count).

update buckets in `LookAheadBucketting`

Currently, items are sorted into buckets based on the sum of the source and target sequence lengths. This can lead to incorrect padding decisions.

Items should be sorted using a tuple of lengths, rather than a single sum of lengths. In principle, this would entail maintaining a 2d array of buckets and devising a systematic iteration through this array.

Redesign of module initialization

Redesign parameter initialization so as to ensure a consistent and systematic design
Allow non-random initialization:
- from huggingface
- from other related mammoth modules

cleanup & refactoring

We have a number of things to clean up:

remove or replace mentions of openNMT-py / onmt
remove config options that are no longer supported
add git badges for build status and stuff, licence, etc.

Some normalization in order:

opts and opt should be made consistent and distinct from optim
onmt.utils.distributed.py should be its own submodule onmt.distributed
it is doubtful we want to keep encoder / decoder as separate modules, rather than grouping them into onmt.modules (would clean up code for adapters, and would be a reasonable place to move attention_bridge.py).
enc_rnn_size, model_dim, dec_rnn_size ... should be normalized to a single, consistent variable
pool_size, bucket_size

Broken gradient in mixed validation settings

When not all of the tasks have validation paths, I get the following stack trace after the first validation

-- Tracebacks above this line can probably be ignored --

Traceback (most recent call last):
  File "/users/mickusti/onmtttest/onmt/utils/distributed.py", line 382, in consumer
    process_fn(
  File "/users/mickusti/onmtttest/onmt/train_single.py", line 214, in main
    trainer.train(
  File "/users/mickusti/onmtttest/onmt/trainer.py", line 292, in train
    onmt.utils.distributed.only_ready_reduce_and_rescale_grads(params, group=group)
  File "/users/mickusti/onmtttest/onmt/utils/distributed.py", line 207, in only_ready_reduce_and_rescale_grads
    grads = [p.grad.data for p in params_with_grad]
  File "/users/mickusti/onmtttest/onmt/utils/distributed.py", line 207, in <listcomp>
    grads = [p.grad.data for p in params_with_grad]
AttributeError: 'NoneType' object has no attribute 'data'

Plugging things back in: Translator features/ sharding

Setting a shard_size different from 0 currently causes the model to duplicate the test set.

The simplest way to circumvent this at the moment is to set the flag -shard_size 0.

Currently, the src_feats_shard has been commented out: i.e., manual features can't be provided to the model for now.

make accumulation consistent and limit tasks per grad accum to 1

In the TQM, rename tasks_per_communication_batch to accum_count
Ensure that only one task is seen per gradient accumulation (set n_samples to 1 in the strategy & repeat the same value accum_count times)

Does MAMMOTH framework plan to focus on evaluations?

Thanks for the work!

Is there any plan to focus on evaluations beyond traditional train/val/test splitting of data and running certain metrics? For example are there any plans to have model-based translation evaluation, i.e, a separate model trained solely to evaluate translation, possibly reference-free (similar to COMET). Or another language model instruction tuned to choose which translation is better (given several options) and explain the reasoning?

Write end of training to structured log file

HPO needs a message that training has been completed, so that it can end the trial and begin another one.
Currently only validation results are written to the structured log file.
We should add a structured message indicating end of training.

task-based LR scheduling, finer-grained val reporting

break down accuracy and validation loss per task when reporting stats
introduce validation-based LR scheduling (reduce LR on plateau) that's properly adjusted depending on the performances of individual modules:
- for each module, initialize module-specific LRs to the default value
- at validation:
  - compute perf per task
  - aggregate (mean or task-weighted mean) perf per module across task
  - reduce module-specific LR based on the aggregate perf

partial training of system

freezing some of the modules would allow training adapter as actual adapters.

Ideally, this would entail introducing some mechanism to mark in the config specific layerstacks/adapaters as not requiring gradient.

To be confirmed, but we can probably just do a combination of the following to get the desired behavior:

leave marked models of all communication groups
not apply the forward has_grad hook to these models
remove them from gradient computations with module.requires_grad_(False)

workflow failure on doc building

Current actions for doc building are failing with the following error message:

Exception occurred:
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/flask/helpers.py", line 16, in <module>
    from werkzeug.urls import url_quote
ImportError: cannot import name 'url_quote' from 'werkzeug.urls' (/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/werkzeug/urls.py)
The full traceback has been saved in /tmp/sphinx-err-dybf5qvp.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [Makefile:20: html] Error 2
Error: Process completed with exit code 2.

Failures seem to be systematic since at least today, the last successful doc building was action 156, three days ago. Everything that was run today has been failing doc building,

Not entirely sure what's causing it; moreover I'm having a hard time reproducing it locally. My guess would be something related to dependencies.

Implement MASS objective for denoising autoencoder

According to Baziotis et al [1], the MASS objective [2] is superior to BART, for training on monolingual data using the denoising sequence auto-encoder approach.

BART (alreday implemented in mammoth) is easier to implement though, as it fits with the transform framework. MASS on the other hand requires:

Two separate target language sequences: decoder input and output. (x_5 is replaced by a mask symbol in the input)
Token-level masking of the loss. Only the three unmasked tokens x_3--x_5 contribute to the loss. The other tokens can be generated (as it is easier than not doing so), but their contribution to the loss should be zeroed out.

Both MASS and BART use sampled noise, and must be applied online during training, as opposed to cleaning preprocessing (#13) that can also be applied offline.

Note that MASS is tricky to implement, while BART we already have. It may not be worth the effort to implement this.

[1] Baziotis et. al (2023) "When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale" https://arxiv.org/abs/2305.14124
[2] Song et. al (2019) "MASS: Masked Sequence to Sequence Pre-training for Language Generation" https://arxiv.org/abs/1905.02450

Developping data filtering transforms

Current preliminary experiments suggest we have datapoints with repeated substrings; likely other quality issues are to be found down the line. It would be useful to means of removing unusable or garbage data (similar to what is done in OpusFilter).

This is something that can be done with transforms (same mechanism as in OpenNMT-py, since that's where it comes from).

Transform objects inherit from onmt.transforms.Transform, and have:

an apply method, that takes as input an example dict containing a srcand tgt key and return:
- either None (to signal the datapoint was dropped)
- or a dict (containing potentially modified copies of src and tgt, for further preprocessing)
an add_option and a _parse_opts class methods, to declare config arguments and retrieve values as necessary
a @register_transform decorator to ensure they are automatically discovered by the rest of the code.

See also the declaration of FilterTooLongTransform for a minimal example of how the filtertoolong transform is declared in the code.

Ideally, for this enhancement, one would need to:

create a file onmt/transforms/filtering.py
move everything related to the FilterTooLongTransform in onmt/transforms/misc.py to this new file
either modify this transform or create new classes to remove datapoints based on:
- source length / target length ratios
- repeated substrings in source or in target
- Levenshtein similarity between source and target
- whatever else makes sense

The filtertoolong transform is insufficient to remove overlong examples

Even when using the filtertoolong transform both in the beginning and the end of the transform pipeline, there still remains issues indicating that excessively long examples are not successfully removed.

Training may crash when embedding the input:

onmt.modules.embeddings.SequenceTooLongError: Sequence is 5124 but PositionalEncoding is limited to 5000. See max_len argument.

Also, the sentencepiece transform logs the following warnings, which also indicates very long and/or repetitive strings:

unigram_model.cc(494) LOG(WARNING) Too big agenda size 10004. Shrinking (round 134) down to 50.

One possible approach is to implement two different filtering transforms:

In the beginning of the pipeline, remove empty strings and strings consisting of too many characters.
After subword segmentation, remove strings consisting of too many tokens.