Git Product home page Git Product logo

cluse's Introduction

CLUSE: Cross-Lingual Unsupervised Sense Embeddings

Model

CLUSE is an unsupervised learning framework for crosslingual sense embeddings, whose goal is to provide the community with:

  • state-of-the-art multilingual sense embeddings where the embeddings are aligned in a common space
  • large-scale and high-quality English-Chinese contextual similarity evaluation dataset

Dependencies

Get training & evaluation datasets

Get training dataset for Engilsh-German parallel corpus: Europarl.

Get training dataset for English-Chinese parallel corpus: UM-Corpus.

Get mono-lingual sense embeddings evaluation dataset: SCWS.

Get cross-lingual sense embeddings evaluation dataset: BCWS.

Please cite the corresponding papers if you use the above datasets.

Data preprocessing

All the data are in the data/ directory. You can safely download the preprocessed data from ftp://miulab.myds.me/CLUSE/data.zip. Unzip it and replace the old data/ directory.

Or you can preprocess the data by yourself.

First put the bcws.txt from BCWS into data/en_ch/.

python bi_make_sensplit_test.py (to produce bi_ratings.txt and english (bcws_en.txt) chinese texts (bcws_zh.txt))

Then put the ratings.txt from SCWS into data/en_ch/ and data/en_de/.

python make_sensplit_test_general.py en_vocab ratings.txt scws_ratings.txt (to produce scws ratings and english (scws_en.txt))

Since this work requires parallel corpus, you have to prepare two files for each language pair. These two files should have the same number of lines, such that the sentences with same line number form a paralle setence pair.

For example, to prepare the training and evaluation data for the Engilsh-German language pair,

cd data/en_de/
bash run.sh english_parallel german_parallel english_vocab_size german_vocab_size

To reproduce the results in the paper,

bash run.sh europarl-v7.de-en.en europarl-v7.de-en.de 6000 6000

will generate all the training and evaluation files.

Similarly,

cd data/en_ch/
bash run.sh en.txt ch.txt 6000 6000

will generate all the training and evaluation files for the Engilsh-Chinese language pair. Note that there are several domains in UM-Corpus, and we simply concatenate all the files.

Training

To train the Engilsh-German sense embeddings:

docker build -t nvidia-miniconda3-cuda9-18.04 .
docker run  --name crossling2 -it -v /home/ql261/crossling_contextualized_embed:/home/ql261/crossling_contextualized_embed nvidia-miniconda3-cuda9-18.04 /bin/bash
docker attach crossling2
TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
g++ -std=c++11 -shared word2vec_ops.cc word2vec_kernels.cc -o word2vec_ops.so -fPIC ${TF_CFLAGS[@]} ${TF_LFLAGS[@]} -O2

cd en_de/
bash train.sh checkpoint_dir major_weight reg_weight

For example,

bash train.sh log 0.5 1.0

will train the model and save the checkpoint files to log directory with the specified major weight and regularization weight. For details, please refer to the paper.

Similarly,

cd en_ch/
bash train.sh checkpoint_dir major_weight reg_weight

will train the model for the English-Chinese sense embeddings.

Evaluation

You will see the spearman correlation score of SCWS/BCWS during the training process.

To evaluate the trained models:

cd en_de/ or cd en_ch/
bash dump.sh path_to_ckpt

will evaluate the SCWS/BCWS again and dump the trained sense embeddings.

To decode the sense for a specific word with its context,

cd en_de/ or cd en_ch/
bash decode.sh path_to_ckpt

Note that we only allow for English input currently.

Results (AvgSimC / MaxSimC)

Model Bilingual Weight Bilingual (BCWS)
Luong et al. (2015) - 50.4
Conneau et al. (2017) - 54.7
CLUSE 0.1 58.3 / 58.3
0.3 58.8 / 58.8
0.5 58.5 / 58.5
0.7 58.3 / 58.4
0.9 58.3 / 58.3

References

Please cite [1] if you found the resources in this repository useful and cite [2] if you use the BCWS dataset.

CLUSE: Cross-Lingual Unsupervised Sense Embeddings

[1] Ta-Chung Chi and Yun-Nung Chen, CLUSE: Cross-Lingual Unsupervised Sense Embeddings

@inproceedings{chi-chen:2018:EMNLP2018,
  author    = {Chi, Ta-Chung  and  Chen, Yun-Nung},
  title     = {Cluse: Cross-lingual underspervised sense embeddings},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing(EMNLP)},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
}

BCWS: Bilingual Contextual Word Similarity

[2] Ta-Chung Chi, Ching-Yen Shih and Yun-Nung Chen, BCWS: Bilingual Contextual Word Similarity

@article{bcws,
  title={BCWS: Bilingual Contextual Word Similarity},
  author={Ta-Chung Chi, Ching-Yen Shih and Yun-Nung Che},
  journal={arXiv preprint arXiv:},
  year={2018}
}

Acknowledgement

This project is supported by Google Faculty Research Awards and MOST.

cluse's People

Contributors

qianchu avatar yvchen avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.