Git Product home page Git Product logo

elmoformanylangs's Introduction

Pre-trained ELMo Representations for Many Languages

We release our ELMo representations trained on many languages which helps us win the CoNLL 2018 shared task on Universal Dependencies Parsing according to LAS.

Technique Details

We use the same hyperparameter settings as Peters et al. (2018) for the biLM and the character CNN. We train their parameters on a set of 20-million-words data randomly sampled from the raw text released by the shared task (wikidump + common crawl) for each language. We largely based ourselves on the code of AllenNLP, but made the following changes:

  • We support unicode characters;
  • We use the sample softmax technique to make training on large vocabulary feasible (Jean et al., 2015). However, we use a window of words surrounding the target word as negative samples and it shows better performance in our preliminary experiments.

The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.

Downloads

Arabic Bulgarian Catalan Czech
Old Church Slavonic Danish German Greek
English Spanish Estonian Basque
Persian Finnish French Irish
Galician Ancient_Greek Hebrew Hindi
Croatian Hungarian Indonesian Italian
Japanese Korean Latin Latvian
Dutch Norwegian-Bokmaal Norwegian-Nynorsk Polish
Portuguese Romanian Russian Slovak
Slovenian Swedish Turkish Uyghur
Ukrainian Urdu Vietnamese Traditional-Chinese
Simplified-Chinese

The Simplified Chinese model was trained on xinhua proportion of gigawords-v5.

Pre-requirements

  • python 3.6
  • pytorch 0.4
  • other requirements from allennlp

Usage

First, after unzip the model, please change the "config_path" field in ${lang}.model/config.json to ${project_home}/configs/cnn_50_100_512_4096_sample.json.

Then, prepare your input file in the conllu format, like

1   Sue    Sue    _   _   _   _   _   _   _
2   likes  like   _   _   _   _   _   _   _
3   coffee coffee _   _   _   _   _   _   _
4   and    and    _   _   _   _   _   _   _
5   Bill   Bill   _   _   _   _   _   _   _
6   tea    tea    _   _   _   _   _   _   _

Fileds should be separate by '\t'. We only use the second column and space (' ') is allowed in this field (for Vietnamese, a word can contains space). Do remember tokenization!

When it's all set, run

python src/gen_elmo.py test \
    --input_format conll \
    --input /path/to/your/input \
    --model /path/to/your/model \
    --output_ave /path/to/your/output

It will dump an hdf5 encoded dict onto the disk, where the key is '\t' separated words in the sentence and the value is it's 3-layer averaged ELMo representation. You can also dump the first layer using the --output_lstm option. We are actively changing the interface to make it more adapted to the AllenNLP ELMo and more programmatically friendly.

Training Your Own ELMo

Please run

python src/biLM.py train -h

to get more details about the ELMo training. However, we need to add that the training process is not very stable. In some cases, we end up with a loss of nan. We are actively working on that and hopefully improve it in the future.

Citation

If our ELMo gave you nice improvements, please cite us.

  • Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, and Ting Liu. 2018. Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation. (to appear) In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (CoNLL).

Contributor

elmoformanylangs's People

Contributors

oneplus avatar bozheng-hit avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.