davidnemeskey / embert Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 147 KB

emtsv module for pre-trained Transfomer-based models

License: GNU Lesser General Public License v3.0

Python 100.00%

embert's People

Contributors

Stargazers

Watchers

Forkers

tamvar

embert's Issues

Use DistributedDataParallel

Under Python 3.9, I could not get the DataParallel-based training to work (version 1.2.0), so multi-GPU training has been removed from version 1.3.0. Reimplement this functionality properly with DistributedDataParallel.

dataclasses is >= Python 3.7 need to update requirements.txt to support 3.6

As line 6 in embert/data_classes.py imports dataclasses which requires Python 3.7 minimum the line dataclasses needed to be added to requirements.txt for Pyhton 3.6 support.

Please update requirements.txt or correct the supported Python versions in setup.py!

Thank you!

Warnings and errors during prediction

FYI: If I try to run emBERT (ab4bbda and 2393b4a@emtsv) on MNSZ2 the following warnings and errors are generated:

/usr/local/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
done
/app/embert/embert/viterbi.py:29: RuntimeWarning: divide by zero encountered in log
  self.init = fn(init, dtype=float)
/app/embert/embert/viterbi.py:31: RuntimeWarning: divide by zero encountered in log
  self.trans = fn(trans, dtype=float).T

Do I have to have CUDA for prediction?
These divide by zero warnings do not look good. They should be silenced if they are not meaningful.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/xtsv/tsvhandler.py", line 67, in process
    yield from ('{0}\n'.format('\t'.join(tok)) for tok in internal_app.process_sentence(sen, field_values))
  File "/app/embert/embert/embert.py", line 99, in process_sentence
    classes = self.evaluator.predict().y_pred[0]
  File "/app/embert/embert/evaluate.py", line 73, in predict
    return self(True)
  File "/app/embert/embert/evaluate.py", line 97, in __call__
    seq_len = np.where(max_prob == self.sep_id)[0][0]
IndexError: index 0 is out of bounds for axis 0 with size 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/main.py", line 25, in <module>
    output_iterator.writelines(build_pipeline(input_data, used_tools, tools, presets, conll_comments))
  File "/usr/local/lib/python3.8/site-packages/xtsv/tsvhandler.py", line 70, in process
    raise type(e)('In "{0}" at {1}: {2}'.format(track_stream['file_name'], curr_line, str(e))).\
  File "/usr/local/lib/python3.8/site-packages/xtsv/tsvhandler.py", line 67, in process
    yield from ('{0}\n'.format('\t'.join(tok)) for tok in internal_app.process_sentence(sen, field_values))
  File "/app/embert/embert/embert.py", line 99, in process_sentence
    classes = self.evaluator.predict().y_pred[0]
  File "/app/embert/embert/evaluate.py", line 73, in predict
    return self(True)
  File "/app/embert/embert/evaluate.py", line 97, in __call__
    seq_len = np.where(max_prob == self.sep_id)[0][0]
IndexError: In "no filename for stream" at [LINE XXX]: index 0 is out of bounds for axis 0 with size 0

This stops the processing of the current file.

I don't want to urge you, just reporting the above bugs/behaviour.

Please pin transformers to <3.0.0 as the new installs of emBERT are broken

Due to API breakage in transformers package, you should either pin the package version, or update emBERT to support newer transformers package version.

Personally, I recommend the first option.

Thank You!

UTFA

Use The Fine API. transformers has improved a lot lately (it even has ~~Autobots~~ AutoModels); in particular, tokenizers have become more convenient to use. Look into which parts of the code can be replaced with now-standard API calls.

Sentiment analysis

Add a sentiment classifier tool.

intended meaning of arguments in `tokenization_comparison.py` (question and suggestion)

Dávid, I create this issue, because I think its easier to keep track of than an e-mail,
but if you don't like this, feel free to close, and continue somehow else.
Thanks for the awesome repo!
In my understanding, tokenization_comparison compares two tokenizers (in the WordPiece sense) based on a corpus, so

--vocab-file is the gold tokenizer,
--model-dir is the "system" tokenizer, and
--input-dir is the corpus.
Am I right? If so, the kwargs might be renamed accordingly.

Test multi-learning

I.e. integrating both NER and chunking into the same model. This would make the model size required for all three tasks much more managable.

Wrong entity counts for BIOE1

When the input is in the BIOE1 format, the entity counts in the evaluation report are wrong; the number of entities is lower than it should be. Apparently the problem is in seqeval, which handles BIOES but not BIOE1.

Support all model architectures

... not just BERT. This should not be too difficult now with AutoModels.

Update the README

With the full huBERT results.

Multi-GPU training is not working

In a multi-GPU environment (eg. at lambda) the training stops with the following error:

Traceback (most recent call last):
  File "emBERT/scripts/train_embert.py", line 502, in <module>
    main()
  File "emBERT/scripts/train_embert.py", line 460, in main
    trainer.train()
  File "emBERT/scripts/train_embert.py", line 239, in train
    self.train_step(stats)
  File "emBERT/scripts/train_embert.py", line 260, in train_step
    label_ids, valid_ids, l_mask)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/emBERT/embert/model.py", line 24, in forward
    device=next(self.parameters()).device
StopIteration

self.parameters() seems to yield empty iterator. The same setup runs flawlessly if only one GPU is used with CUDA_VISIBLE_DEVICES="1" .
Did you managed to run it in such environment? Do you have any idea what could be wrong and how to fix this error?

Viterbi doesn't work for BIO

The only valid transition from a B- label is to an I- label, which is correct for BIOES/1, but not for BIO, in which single-word entities are also marked with B-.

Multilingual performance

How does it change after fine-tuning? (E.g. English, but check out CamemBERT, etc. to see if there are benchmarks for other languages as well).

Look into tokenization

The original BERT was trained with raw text, and punctuation marks were generally seen attached to words. In emBERT, we take the output of emToken, so punctuation marks are tokens in their own right. This discrepancy might affect performance.

Check if this is really the case. The basic tokenization procedure does split punctuation from the end of words, so the problem might not be as acute as it seems at first sight.
Merge punctuation tokens with the words before sending them to the BERT model.
Alternatively, skip emToken altogether?

RFC: Poroposed packaging format for the models

As e-magyar (emtsv) is transitioning to use python packages instead of GIT submodules and LFS I've created a proposal for packaging emBERT models.

You can see the proposal here: https://github.com/dlt-rilmta/emBERT-models/tree/packaging

With a slight modification in emBERT models could be checked as installed python packages too in order to extract the location of the actual files via the model_dir attribute (e.g. embert.models.szeged_maxnp_bioes.model_dir ).

Packaging enables us to simplify the install process and separately version the models from the main module.

Please review it and share your thoughts. Also feel free to modify the code or merge.