CLUSE is an unsupervised learning framework for crosslingual sense embeddings, whose goal is to provide the community with:
- state-of-the-art multilingual sense embeddings where the embeddings are aligned in a common space
- large-scale and high-quality English-Chinese contextual similarity evaluation dataset
- Python 2.7/3.6 with NumPy/SciPy
- opencc-python-reimplemented
- zhon
- jieba
- nltk
- Tensorflow 1.10 with CUDA 9.0 and CuDNN v7.0.5
Get training dataset for Engilsh-German parallel corpus: Europarl.
Get training dataset for English-Chinese parallel corpus: UM-Corpus.
Get mono-lingual sense embeddings evaluation dataset: SCWS.
Get cross-lingual sense embeddings evaluation dataset: BCWS.
Please cite the corresponding papers if you use the above datasets.
All the data are in the data/ directory. You can safely download the preprocessed data from ftp://miulab.myds.me/CLUSE/data.zip. Unzip it and replace the old data/ directory.
Or you can preprocess the data by yourself.
First put the bcws.txt from BCWS into data/en_ch/.
python bi_make_sensplit_test.py (to produce bi_ratings.txt and english (bcws_en.txt) chinese texts (bcws_zh.txt))
Then put the ratings.txt from SCWS into data/en_ch/ and data/en_de/.
python make_sensplit_test_general.py en_vocab ratings.txt scws_ratings.txt (to produce scws ratings and english (scws_en.txt))
Since this work requires parallel corpus, you have to prepare two files for each language pair. These two files should have the same number of lines, such that the sentences with same line number form a paralle setence pair.
For example, to prepare the training and evaluation data for the Engilsh-German language pair,
cd data/en_de/
bash run.sh english_parallel german_parallel english_vocab_size german_vocab_size
To reproduce the results in the paper,
bash run.sh europarl-v7.de-en.en europarl-v7.de-en.de 6000 6000
will generate all the training and evaluation files.
Similarly,
cd data/en_ch/
bash run.sh en.txt ch.txt 6000 6000
will generate all the training and evaluation files for the Engilsh-Chinese language pair. Note that there are several domains in UM-Corpus, and we simply concatenate all the files.
To train the Engilsh-German sense embeddings:
docker build -t nvidia-miniconda3-cuda9-18.04 .
docker run --name crossling2 -it -v /home/ql261/crossling_contextualized_embed:/home/ql261/crossling_contextualized_embed nvidia-miniconda3-cuda9-18.04 /bin/bash
docker attach crossling2
TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
g++ -std=c++11 -shared word2vec_ops.cc word2vec_kernels.cc -o word2vec_ops.so -fPIC ${TF_CFLAGS[@]} ${TF_LFLAGS[@]} -O2
cd en_de/
bash train.sh checkpoint_dir major_weight reg_weight
For example,
bash train.sh log 0.5 1.0
will train the model and save the checkpoint files to log directory with the specified major weight and regularization weight. For details, please refer to the paper.
Similarly,
cd en_ch/
bash train.sh checkpoint_dir major_weight reg_weight
will train the model for the English-Chinese sense embeddings.
You will see the spearman correlation score of SCWS/BCWS during the training process.
To evaluate the trained models:
cd en_de/ or cd en_ch/
bash dump.sh path_to_ckpt
will evaluate the SCWS/BCWS again and dump the trained sense embeddings.
To decode the sense for a specific word with its context,
cd en_de/ or cd en_ch/
bash decode.sh path_to_ckpt
Note that we only allow for English input currently.
Model | Bilingual Weight | Bilingual (BCWS) |
---|---|---|
Luong et al. (2015) | - | 50.4 |
Conneau et al. (2017) | - | 54.7 |
CLUSE | 0.1 | 58.3 / 58.3 |
0.3 | 58.8 / 58.8 | |
0.5 | 58.5 / 58.5 | |
0.7 | 58.3 / 58.4 | |
0.9 | 58.3 / 58.3 |
Please cite [1] if you found the resources in this repository useful and cite [2] if you use the BCWS dataset.
[1] Ta-Chung Chi and Yun-Nung Chen, CLUSE: Cross-Lingual Unsupervised Sense Embeddings
@inproceedings{chi-chen:2018:EMNLP2018,
author = {Chi, Ta-Chung and Chen, Yun-Nung},
title = {Cluse: Cross-lingual underspervised sense embeddings},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing(EMNLP)},
month = {October},
year = {2018},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
}
[2] Ta-Chung Chi, Ching-Yen Shih and Yun-Nung Chen, BCWS: Bilingual Contextual Word Similarity
@article{bcws,
title={BCWS: Bilingual Contextual Word Similarity},
author={Ta-Chung Chi, Ching-Yen Shih and Yun-Nung Che},
journal={arXiv preprint arXiv:},
year={2018}
}
This project is supported by Google Faculty Research Awards and MOST.