The cluse from qianchu

CLUSE: Cross-Lingual Unsupervised Sense Embeddings

CLUSE is an unsupervised learning framework for crosslingual sense embeddings, whose goal is to provide the community with:

state-of-the-art multilingual sense embeddings where the embeddings are aligned in a common space
large-scale and high-quality English-Chinese contextual similarity evaluation dataset

Dependencies

Python 2.7/3.6 with NumPy/SciPy
opencc-python-reimplemented
zhon
jieba
nltk
Tensorflow 1.10 with CUDA 9.0 and CuDNN v7.0.5

Get training & evaluation datasets

Get training dataset for Engilsh-German parallel corpus: Europarl.

Get training dataset for English-Chinese parallel corpus: UM-Corpus.

Get mono-lingual sense embeddings evaluation dataset: SCWS.

Get cross-lingual sense embeddings evaluation dataset: BCWS.

Please cite the corresponding papers if you use the above datasets.

Data preprocessing

All the data are in the data/ directory. You can safely download the preprocessed data from ftp://miulab.myds.me/CLUSE/data.zip. Unzip it and replace the old data/ directory.

Or you can preprocess the data by yourself.

First put the bcws.txt from BCWS into data/en_ch/.

python bi_make_sensplit_test.py (to produce bi_ratings.txt and english (bcws_en.txt) chinese texts (bcws_zh.txt))

Then put the ratings.txt from SCWS into data/en_ch/ and data/en_de/.

python make_sensplit_test_general.py en_vocab ratings.txt scws_ratings.txt (to produce scws ratings and english (scws_en.txt))

Since this work requires parallel corpus, you have to prepare two files for each language pair. These two files should have the same number of lines, such that the sentences with same line number form a paralle setence pair.

For example, to prepare the training and evaluation data for the Engilsh-German language pair,

cd data/en_de/
bash run.sh english_parallel german_parallel english_vocab_size german_vocab_size

To reproduce the results in the paper,

bash run.sh europarl-v7.de-en.en europarl-v7.de-en.de 6000 6000

will generate all the training and evaluation files.

Similarly,

cd data/en_ch/
bash run.sh en.txt ch.txt 6000 6000

will generate all the training and evaluation files for the Engilsh-Chinese language pair. Note that there are several domains in UM-Corpus, and we simply concatenate all the files.

Training

To train the Engilsh-German sense embeddings:

docker build -t nvidia-miniconda3-cuda9-18.04 .
docker run  --name crossling2 -it -v /home/ql261/crossling_contextualized_embed:/home/ql261/crossling_contextualized_embed nvidia-miniconda3-cuda9-18.04 /bin/bash
docker attach crossling2
TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
g++ -std=c++11 -shared word2vec_ops.cc word2vec_kernels.cc -o word2vec_ops.so -fPIC ${TF_CFLAGS[@]} ${TF_LFLAGS[@]} -O2

cd en_de/
bash train.sh checkpoint_dir major_weight reg_weight

For example,

bash train.sh log 0.5 1.0

will train the model and save the checkpoint files to log directory with the specified major weight and regularization weight. For details, please refer to the paper.

Similarly,

cd en_ch/
bash train.sh checkpoint_dir major_weight reg_weight

will train the model for the English-Chinese sense embeddings.

Evaluation

You will see the spearman correlation score of SCWS/BCWS during the training process.

To evaluate the trained models:

cd en_de/ or cd en_ch/
bash dump.sh path_to_ckpt

will evaluate the SCWS/BCWS again and dump the trained sense embeddings.

To decode the sense for a specific word with its context,

cd en_de/ or cd en_ch/
bash decode.sh path_to_ckpt

Note that we only allow for English input currently.

Results (AvgSimC / MaxSimC)

Model	Bilingual Weight	Bilingual (BCWS)
Luong et al. (2015)	-	50.4
Conneau et al. (2017)	-	54.7
CLUSE	0.1	58.3 / 58.3
	0.3	58.8 / 58.8
	0.5	58.5 / 58.5
	0.7	58.3 / 58.4
	0.9	58.3 / 58.3

References

Please cite [1] if you found the resources in this repository useful and cite [2] if you use the BCWS dataset.

CLUSE: Cross-Lingual Unsupervised Sense Embeddings

[1] Ta-Chung Chi and Yun-Nung Chen, CLUSE: Cross-Lingual Unsupervised Sense Embeddings

@inproceedings{chi-chen:2018:EMNLP2018,
  author    = {Chi, Ta-Chung  and  Chen, Yun-Nung},
  title     = {Cluse: Cross-lingual underspervised sense embeddings},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing(EMNLP)},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
}

BCWS: Bilingual Contextual Word Similarity

[2] Ta-Chung Chi, Ching-Yen Shih and Yun-Nung Chen, BCWS: Bilingual Contextual Word Similarity

@article{bcws,
  title={BCWS: Bilingual Contextual Word Similarity},
  author={Ta-Chung Chi, Ching-Yen Shih and Yun-Nung Che},
  journal={arXiv preprint arXiv:},
  year={2018}
}

Acknowledgement

This project is supported by Google Faculty Research Awards and MOST.

qianchu / cluse Goto Github PK

cluse's Introduction

CLUSE: Cross-Lingual Unsupervised Sense Embeddings

Dependencies

Get training & evaluation datasets

Data preprocessing

Training

Evaluation

Results (AvgSimC / MaxSimC)

References

CLUSE: Cross-Lingual Unsupervised Sense Embeddings

BCWS: Bilingual Contextual Word Similarity

Acknowledgement

cluse's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent