Pre-trained ELMo Representations for Many Languages

We release our ELMo representations trained on many languages which helps us win the CoNLL 2018 shared task on Universal Dependencies Parsing according to LAS.

Technique Details

We use the same hyperparameter settings as Peters et al. (2018) for the biLM and the character CNN. We train their parameters on a set of 20-million-words data randomly sampled from the raw text released by the shared task (wikidump + common crawl) for each language. We largely based ourselves on the code of AllenNLP, but made the following changes:

We support unicode characters;
We use the sample softmax technique to make training on large vocabulary feasible (Jean et al., 2015). However, we use a window of words surrounding the target word as negative samples and it shows better performance in our preliminary experiments.

The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.

Downloads


Arabic	Bulgarian	Catalan	Czech
Old Church Slavonic	Danish	German	Greek
English	Spanish	Estonian	Basque
Persian	Finnish	French	Irish
Galician	Ancient_Greek	Hebrew	Hindi
Croatian	Hungarian	Indonesian	Italian
Japanese	Korean	Latin	Latvian
Dutch	Norwegian-Bokmaal	Norwegian-Nynorsk	Polish
Portuguese	Romanian	Russian	Slovak
Slovenian	Swedish	Turkish	Uyghur
Ukrainian	Urdu	Vietnamese	Traditional-Chinese
Simplified-Chinese

The Simplified Chinese model was trained on xinhua proportion of gigawords-v5.

Pre-requirements

python 3.6
pytorch 0.4
other requirements from allennlp

Usage

First, after unzip the model, please change the "config_path" field in ${lang}.model/config.json to ${project_home}/configs/cnn_50_100_512_4096_sample.json.

Then, prepare your input file in the conllu format, like

1   Sue    Sue    _   _   _   _   _   _   _
2   likes  like   _   _   _   _   _   _   _
3   coffee coffee _   _   _   _   _   _   _
4   and    and    _   _   _   _   _   _   _
5   Bill   Bill   _   _   _   _   _   _   _
6   tea    tea    _   _   _   _   _   _   _

Fileds should be separate by '\t'. We only use the second column and space (' ') is allowed in this field (for Vietnamese, a word can contains space). Do remember tokenization!

When it's all set, run

python src/gen_elmo.py test \
    --input_format conll \
    --input /path/to/your/input \
    --model /path/to/your/model \
    --output_ave /path/to/your/output

It will dump an hdf5 encoded dict onto the disk, where the key is '\t' separated words in the sentence and the value is it's 3-layer averaged ELMo representation. You can also dump the first layer using the --output_lstm option. We are actively changing the interface to make it more adapted to the AllenNLP ELMo and more programmatically friendly.

Training Your Own ELMo

Please run

python src/biLM.py train -h

to get more details about the ELMo training. However, we need to add that the training process is not very stable. In some cases, we end up with a loss of nan. We are actively working on that and hopefully improve it in the future.

Citation

If our ELMo gave you nice improvements, please cite us.

Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, and Ting Liu. 2018. Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation. (to appear) In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (CoNLL).

Contributor

Bo Zheng <[email protected]>

chenmoshushi / elmoformanylangs Goto Github PK

elmoformanylangs's Introduction

Pre-trained ELMo Representations for Many Languages

Technique Details

Downloads

Pre-requirements

Usage

Training Your Own ELMo

Citation

Contributor

elmoformanylangs's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent