Git Product home page Git Product logo

embert's People

Contributors

davidnemeskey avatar dependabot[bot] avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

tamvar

embert's Issues

Use DistributedDataParallel

Under Python 3.9, I could not get the DataParallel-based training to work (version 1.2.0), so multi-GPU training has been removed from version 1.3.0. Reimplement this functionality properly with DistributedDataParallel.

Warnings and errors during prediction

FYI: If I try to run emBERT (ab4bbda and 2393b4a@emtsv) on MNSZ2 the following warnings and errors are generated:

/usr/local/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
done
/app/embert/embert/viterbi.py:29: RuntimeWarning: divide by zero encountered in log
  self.init = fn(init, dtype=float)
/app/embert/embert/viterbi.py:31: RuntimeWarning: divide by zero encountered in log
  self.trans = fn(trans, dtype=float).T

Do I have to have CUDA for prediction?
These divide by zero warnings do not look good. They should be silenced if they are not meaningful.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/xtsv/tsvhandler.py", line 67, in process
    yield from ('{0}\n'.format('\t'.join(tok)) for tok in internal_app.process_sentence(sen, field_values))
  File "/app/embert/embert/embert.py", line 99, in process_sentence
    classes = self.evaluator.predict().y_pred[0]
  File "/app/embert/embert/evaluate.py", line 73, in predict
    return self(True)
  File "/app/embert/embert/evaluate.py", line 97, in __call__
    seq_len = np.where(max_prob == self.sep_id)[0][0]
IndexError: index 0 is out of bounds for axis 0 with size 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/main.py", line 25, in <module>
    output_iterator.writelines(build_pipeline(input_data, used_tools, tools, presets, conll_comments))
  File "/usr/local/lib/python3.8/site-packages/xtsv/tsvhandler.py", line 70, in process
    raise type(e)('In "{0}" at {1}: {2}'.format(track_stream['file_name'], curr_line, str(e))).\
  File "/usr/local/lib/python3.8/site-packages/xtsv/tsvhandler.py", line 67, in process
    yield from ('{0}\n'.format('\t'.join(tok)) for tok in internal_app.process_sentence(sen, field_values))
  File "/app/embert/embert/embert.py", line 99, in process_sentence
    classes = self.evaluator.predict().y_pred[0]
  File "/app/embert/embert/evaluate.py", line 73, in predict
    return self(True)
  File "/app/embert/embert/evaluate.py", line 97, in __call__
    seq_len = np.where(max_prob == self.sep_id)[0][0]
IndexError: In "no filename for stream" at [LINE XXX]: index 0 is out of bounds for axis 0 with size 0

This stops the processing of the current file.

I don't want to urge you, just reporting the above bugs/behaviour.

UTFA

Use The Fine API. transformers has improved a lot lately (it even has Autobots AutoModels); in particular, tokenizers have become more convenient to use. Look into which parts of the code can be replaced with now-standard API calls.

intended meaning of arguments in `tokenization_comparison.py` (question and suggestion)

Dávid, I create this issue, because I think its easier to keep track of than an e-mail,
but if you don't like this, feel free to close, and continue somehow else.
Thanks for the awesome repo!
In my understanding, tokenization_comparison compares two tokenizers (in the WordPiece sense) based on a corpus, so

  • --vocab-file is the gold tokenizer,
  • --model-dir is the "system" tokenizer, and
  • --input-dir is the corpus.
    Am I right? If so, the kwargs might be renamed accordingly.

Test multi-learning

I.e. integrating both NER and chunking into the same model. This would make the model size required for all three tasks much more managable.

Wrong entity counts for BIOE1

When the input is in the BIOE1 format, the entity counts in the evaluation report are wrong; the number of entities is lower than it should be. Apparently the problem is in seqeval, which handles BIOES but not BIOE1.

Multi-GPU training is not working

In a multi-GPU environment (eg. at lambda) the training stops with the following error:

Traceback (most recent call last):
  File "emBERT/scripts/train_embert.py", line 502, in <module>
    main()
  File "emBERT/scripts/train_embert.py", line 460, in main
    trainer.train()
  File "emBERT/scripts/train_embert.py", line 239, in train
    self.train_step(stats)
  File "emBERT/scripts/train_embert.py", line 260, in train_step
    label_ids, valid_ids, l_mask)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/emBERT/embert/model.py", line 24, in forward
    device=next(self.parameters()).device
StopIteration

self.parameters() seems to yield empty iterator. The same setup runs flawlessly if only one GPU is used with CUDA_VISIBLE_DEVICES="1" .
Did you managed to run it in such environment? Do you have any idea what could be wrong and how to fix this error?

Viterbi doesn't work for BIO

The only valid transition from a B- label is to an I- label, which is correct for BIOES/1, but not for BIO, in which single-word entities are also marked with B-.

Multilingual performance

How does it change after fine-tuning? (E.g. English, but check out CamemBERT, etc. to see if there are benchmarks for other languages as well).

Look into tokenization

The original BERT was trained with raw text, and punctuation marks were generally seen attached to words. In emBERT, we take the output of emToken, so punctuation marks are tokens in their own right. This discrepancy might affect performance.

  1. Check if this is really the case. The basic tokenization procedure does split punctuation from the end of words, so the problem might not be as acute as it seems at first sight.
  2. Merge punctuation tokens with the words before sending them to the BERT model.
  3. Alternatively, skip emToken altogether?

RFC: Poroposed packaging format for the models

As e-magyar (emtsv) is transitioning to use python packages instead of GIT submodules and LFS I've created a proposal for packaging emBERT models.

You can see the proposal here: https://github.com/dlt-rilmta/emBERT-models/tree/packaging

With a slight modification in emBERT models could be checked as installed python packages too in order to extract the location of the actual files via the model_dir attribute (e.g. embert.models.szeged_maxnp_bioes.model_dir ).

Packaging enables us to simplify the install process and separately version the models from the main module.

Please review it and share your thoughts. Also feel free to modify the code or merge.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.