Git Product home page Git Product logo

chinese-word-vectors's Introduction

Chinese Word Vectors 中文词向量

中文

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks.

Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

Reference

Please cite the paper, if using these embeddings and CA8 dataset.

Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.

@InProceedings{P18-2023,
  author =  "Li, Shen
    and Zhao, Zhe
    and Hu, Renfen
    and Li, Wensi
    and Liu, Tao
    and Du, Xiaoyong",
  title =   "Analogical Reasoning on Chinese Morphological and Semantic Relations",
  booktitle =   "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
  year =  "2018",
  publisher =   "Association for Computational Linguistics",
  pages =   "138--143",
  location =  "Melbourne, Australia",
  url =   "http://aclweb.org/anthology/P18-2023"
}

 

A detailed analysis of the relation between the intrinsic and extrinsic evaluations of Chinese word embeddings is shown in the paper:

Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, Lijiao Yang. Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. (CCL & NLP-NABD 2018 Best Paper)

@incollection{qiu2018revisiting,
  title={Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings},
  author={Qiu, Yuanyuan and Li, Hongzheng and Li, Shen and Jiang, Yingdi and Hu, Renfen and Yang, Lijiao},
  booktitle={Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data},
  pages={209--221},
  year={2018},
  publisher={Springer}
}

Format

The pre-trained vector files are in text format. Each line contains a word and its vector. Each value is separated by space. The first line records the meta information: the first number indicates the number of words in the file and the second indicates the dimension size.

Besides dense word vectors (trained with SGNS), we also provide sparse vectors (trained with PPMI). They are in the same format with liblinear, where the number before " : " denotes dimension index and the number after the " : " denotes the value.

Pre-trained Chinese Word Vectors

Basic Settings

                                       
Window SizeDynamic WindowSub-samplingLow-Frequency WordIterationNegative Sampling*
5Yes1e-51055

*Only for SGNS.

Various Domains

Chinese Word Vectors trained with different representations, context features, and corpora.

Word2vec / Skip-Gram with Negative Sampling (SGNS)
Corpus Context Features
Word Word + Ngram Word + Character Word + Character + Ngram
Baidu Encyclopedia 百度百科 300d 300d 300d 300d / PWD: 5555
Wikipedia_zh 中文维基百科 300d 300d 300d 300d
People's Daily News 人民日报 300d 300d 300d 300d
Sogou News 搜狗新闻 300d 300d 300d 300d
Financial News 金融新闻 300d 300d 300d 300d
Zhihu_QA 知乎问答 300d 300d 300d 300d
Weibo 微博 300d 300d 300d 300d
Literature 文学作品 300d 300d / PWD: z5b4 300d 300d / PWD: yenb
Complete Library in Four Sections
四库全书*
300d 300d NAN NAN
Mixed-large 综合
Baidu Netdisk / Google Drive
300d
300d
300d
300d
300d
300d
300d
300d
Positive Pointwise Mutual Information (PPMI)
Corpus Context Features
Word Word + Ngram Word + Character Word + Character + Ngram
Baidu Encyclopedia 百度百科 Sparse Sparse Sparse Sparse
Wikipedia_zh 中文维基百科 Sparse Sparse Sparse Sparse
People's Daily News 人民日报 Sparse Sparse Sparse Sparse
Sogou News 搜狗新闻 Sparse Sparse Sparse Sparse
Financial News 金融新闻 Sparse Sparse Sparse Sparse
Zhihu_QA 知乎问答 Sparse Sparse Sparse Sparse
Weibo 微博 Sparse Sparse Sparse Sparse
Literature 文学作品 Sparse Sparse Sparse Sparse
Complete Library in Four Sections
四库全书*
Sparse Sparse NAN NAN
Mixed-large 综合 Sparse Sparse Sparse Sparse

*Character embeddings are provided, since most of Hanzi are words in the archaic Chinese.

Various Co-occurrence Information

We release word vectors upon different co-occurrence statistics. Target and context vectors are often called input and output vectors in some related papers.

In this part, one can obtain vectors of arbitrary linguistic units beyond word. For example, character vectors is in the context vectors of word-character.

All vectors are trained by SGNS on Baidu Encyclopedia.

                                                       
FeatureCo-occurrence TypeTarget Word VectorsContext Word Vectors
Word Word → Word300d 300d
Ngram Word → Ngram (1-2) 300d 300d
Word → Ngram (1-3) 300d 300d
Ngram (1-2) → Ngram (1-2) 300d 300d
CharacterWord → Character (1) 300d 300d
Word → Character (1-2) 300d 300d
Word → Character (1-4) 300d 300d
Radical Radical 300d 300d
PositionWord → Word (left/right) 300d 300d
Word → Word (distance) 300d 300d
GlobalWord → Text 300d 300d
Syntactic FeatureWord → POS 300d 300d
Word → Dependency300d 300d

Representations

Existing word representation methods fall into one of the two classes, dense and sparse represnetations. SGNS model (a model in word2vec toolkit) and PPMI model are respectively typical methods of these two classes. SGNS model trains low-dimensional real (dense) vectors through a shallow neural network. It is also called neural embedding method. PPMI model is a sparse bag-of-feature representation weighted by positive-pointwise-mutual-information (PPMI) weighting scheme.

Context Features

Three context features: word, ngram, and character are commonly used in the word embedding literature. Most word representation methods essentially exploit word-word co-occurrence statistics, namely using word as context feature (word feature). Inspired by language modeling problem, we introduce ngram feature into the context. Both word-word and word-ngram co-occurrence statistics are used for training (ngram feature). For Chinese, characters (Hanzi) often convey strong semantics. To this end, we consider using word-word and word-character co-occurrence statistics for learning word vectors. The length of character-level ngrams ranges from 1 to 4 (character feature).

Besides word, ngram, and character, there are other features which have substantial influence on properties of word vectors. For example, using entire text as context feature could introduce more topic information into word vectors; using dependency parse as context feature could add syntactic constraint to word vectors. 17 co-occurrence types are considered in this project.

Corpus

We made great efforts to collect corpus across various domains. All text data are preprocessed by removing html and xml tags. Only the plain text are kept and HanLP(v_1.5.3) is used for word segmentation. In addition, traditional Chinese characters are converted into simplified characters with Open Chinese Convert (OpenCC). The detailed corpora information is listed as follows:

Corpus Size Tokens Vocabulary Size Description
Baidu Encyclopedia
百度百科
4.1G 745M 5422K Chinese Encyclopedia data from
https://baike.baidu.com/
Wikipedia_zh
中文维基百科
1.3G 223M 2129K Chinese Wikipedia data from
https://dumps.wikimedia.org/
People's Daily News
人民日报
3.9G 668M 1664K News data from People's Daily(1946-2017)
http://data.people.com.cn/
Sogou News
搜狗新闻
3.7G 649M 1226K News data provided by Sogou labs
http://www.sogou.com/labs/
Financial News
金融新闻
6.2G 1055M 2785K Financial news collected from multiple news websites
Zhihu_QA
知乎问答
2.1G 384M 1117K Chinese QA data from
https://www.zhihu.com/
Weibo
微博
0.73G 136M 850K Chinese microblog data provided by NLPIR Lab
http://www.nlpir.org/wordpress/download/weibo.7z
Literature
文学作品
0.93G 177M 702K 8599 modern Chinese literature works
Mixed-large
综合
22.6G 4037M 10653K We build the large corpus by merging the above corpora.
Complete Library in Four Sections
四库全书
1.5G 714M 21.8K The largest collection of texts in pre-modern China.

All words are concerned, including low frequency words.

Toolkits

All word vectors are trained by ngram2vec toolkit. Ngram2vec toolkit is a superset of word2vec and fasttext toolkit, where arbitrary context features and models are supported.

Chinese Word Analogy Benchmarks

The quality of word vectors is often evaluated by analogy question tasks. In this project, two benchmarks are exploited for evaluation. The first is CA-translated, where most analogy questions are directly translated from English benchmark. Although CA-translated has been widely used in many Chinese word embedding papers, it only contains questions of three semantic questions and covers 134 Chinese words. In contrast, CA8 is specifically designed for Chinese language. It contains 17813 analogy questions and covers comprehensive morphological and semantic relations. The CA-translated, CA8, and their detailed descriptions are provided in testsets folder.

Evaluation Toolkit

We present an evaluation toolkit in evaluation folder.

Run the following codes to evaluate dense vectors.

$ python ana_eval_dense.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_dense.py -v <vector.txt> -a CA8/semantic.txt

Run the following codes to evaluate sparse vectors.

$ python ana_eval_sparse.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_sparse.py -v <vector.txt> -a CA8/semantic.txt

chinese-word-vectors's People

Contributors

embedding avatar shenshen-hungry avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chinese-word-vectors's Issues

Baidu Encyclopedia word embedding file has some misssing spaces

问题文件:
Word2vec / Skip-Gram with Negative Sampling (SGNS)

Baidu Encyclopedia 百度百科的 word

分别是
第269598行
第334166行
第340101行
第386099行
第387913行
第398991行
第403440行
第417792行
第440725行
第510420行
第518270行
第628803行

都是word和第一个数字连在了一起
请核对一下

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: invalid continuation byte

使用的是sgns.merge.word词向量,python3
试了两个方法都不行

    f = open(filename,'r')
    line = f.readline().strip()
    word_dim = int(line.split(' ')[1])
    for line in f:
        row = line.strip().split(' ')
        vocab.append(row[0])
        embd.append(row[1:])
    f.close()

用f.readlines()同样错误
错误:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 3472-3473: invalid continuation byte

    with open(filename, 'rb') as f:
        line = f.readline().decode('utf-8').strip()
        word_dim = int(line.split(' ')[1])
        for line in f:
            row = line.decode('utf-8').strip().split(' ')
            vocab.append(row[0])
            embd.append(row[1:])

错误:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: invalid continuation byte

关于语料预处理

感谢作者提供的词向量~想问一下,作者在训练word2vec之前,使用HanLP对语料进行分词,使用的是哪个分词算法呢?HMM还是CRF,这对后续训练的影响大吗?另外,分词时有使用用户自定义词典吗?还有,分词后会做停用词、标点符号的过滤处理吗?非常感谢

can't use word2vec.load()?

I download (People's Daily News 人民日报, Word + Character + Ngram), and use bzip2 to decompress the file.
I want to use word2vec.load(), so I rename the file name sgns.renmin.bigram-char to sgns_renmin_bigram_char.
the error is:
ValueError: could not broadcast input array from shape (299) into shape (300)

Vocabulary size和embedding中的词汇量不符

我下载了wikipedia语料的word+char模型,用gensim载入之后显示词汇量为352281,但README中的vocabulary size写的是2129K
按我的理解,两者应该是一样的。请问是我的理解有误还是数据有误?

python gensim 不能加载词向量文件

D:\Program\Anaconda3\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
Traceback (most recent call last):
File ".\zzk_word2vec.py", line 101, in
test_word_embedding('D:\data\pretrain_word2vec\Chinese-Word-Vectors\sgns.zhihu.char\sgns.zhihu.char')
File ".\zzk_word2vec.py", line 76, in test_word_embedding
model = gensim.models.KeyedVectors.load_word2vec_format(vector_file, binary=False, encoding='utf8')
File "D:\Program\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py", line 250, in load_word2vec_format
parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ")
File "D:\Program\Anaconda3\lib\site-packages\gensim\utils.py", line 242, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: invalid continuation byte

Word+Character+Ngram问题

大神,有个问题请教一下。
context feature里面的Word,Word+Ngram,Word+Character,Word+Character+Ngram有什么区别?
词向量是使用window中的词成对的训练跑出来的隐层权重矩阵。
Word+Ngram这种是怎么训练的呢?或者说Ngram模型就是个概率表,这个怎么融合进词向量训练里面的呢?还有+Character也是一样的疑问。
不明白这几个的区别,能否讲解一下,谢谢啦。

下载的词向量 无法解压

就是百度网盘下载下来sgns.merge.word.bz2 正常用bzip2 -d sgns.merge.word.bz2 命令解压报错
bzip2: sgns.merge.word.bz2 is not a bzip2 file. 难道下载的问题? 谢谢~~

词向量选择的target都是word吗?只不过context是word、word+char、word+ngram、word+char+ngram

你好~感谢你们将你们的工作开源,受贵组论文启示,我想要用自己的语料库训练context为word+char+ngram的SGNS embedding。于是我又看了ngram2vec的论文,发现其根据target和context不同分为:uni_uni, uni_bi, bi_bi... 。CA8中是只用target为uni的uni_bi吗?然后又在context中加入char?如果我想训练context为word+char+ngram的SGNS embedding,如何将char加入到context呢?是要自己在ngram2vec toolkit中自己写代码添加<word,char>对嘛?

Need Help...

大佬能不能提供下训练词向量的语料呢,需要怎么使用这个词向量?使用Word2Vec.load()会报错

如何使用这些模型呢?

作为一个NLP小白,看完README还是不知道该怎么用这些训练好的东西。

可否提供一个说明:

  1. 这些模型是什么含义,格式如何,如何读取?
  2. 提供一些可以运行的示例代码,包含加载模型,词转向量;
  3. 这么多模型,在做应用时,该如何作选择?

PPMI

想问一下,这个PPMI训练的model怎么使用,不太明白,这方面的信息有点少,可以给一些建议吗?万分感谢

特殊字符的词向量

词向量里面类似google空白或者回车等特殊字符的"",embeding没有啊,如果有请问是用什么代表的?

網盤不存在

您好, 點擊幾個地址之後都顯示網盤不存在, 可否更新鍵結呢? 感謝!

词向量文件无法load

试了一个百科预料生成的此向量,用gensim进行load(之所以用load方式,是因为想用gensim做增量学习),gensim.models.Word2Vec.load("sgns.baidubaike.bigram-char") 报错,

2018-06-25 7 00 08

《四库全书》字向量未提供?

您好, readme 里的《四库全书》注释说提供了字向量,因为古汉语多为单字词;但是却没有提供字向量的链接,请问是不对外开放吗?还是稍后会更新 readme 文件?

谢谢~

分词词库问题

你好。非常感谢作者提供的评估语料和词向量,有些词向量的评估得分远远超过自训练的词向量,所以就想拿这些词向量做一些语义相似性的计算应用。问题来了:CA8里的一些词,Hanlp默认的词库是不包含这些词的,想通过聚合去重来合并现有的词库,但是缺少词频和词性的信息。能不能通过云盘的方式,分享一下针对百度百科语料的词库?

Question about the download links

  1. Could you please publish a link to all of the Baidu Netdisk files? I wish to download all the model files quickly rather than one by one.

  2. Is there any plan to save the model files to other netdisks? For example, Google Drive or Dropbox. It should be very convenient for oversea researchers.

Many thanks for your work!

链接错误

请问有公布训练好的词向量的文件吗?
为什么点击词向量的链接然后跳转到了百度首页?谢谢!

Cannot download the pre-trained vector files

I tried to download context word vectors of Word → Character (1), however, I failed to do that since I cannot register the account of baidu. Can you upload the dataset to other places such as google drive or dropbox? Thanks.

编码问题

非常感谢作者的这项非常棒的工作。但是我在使用gensim加载词向量的时候遇到了encoding问题,一部分字符不能被utf-8解码。不知作者可否提供词向量的二进制文件从而避免这个问题?

增量训练问题

您好!
用gensim训练时,增量训练只能改变模型的参数,新的词汇并不能添加进模型中,这就导致了没办法使用一些预训练好的模型对具体的任务做微调。所以想请问您在使用ngram2vec的过程中遇到过这个问题么?

词向量文件中有重复的词

比如人民日报 sgns.renmin.bigram,有 355989 个词向量,但对词汇进行去重之后只剩下 355973 个不同的词语

用法

我看了几遍都没看懂这玩意儿是怎么用的,难道只是用来看的?

词的质量有待提高

例如Non-pollutionVegetablesProcessingIndustryPark,20171106001562,这样的词基本没什么意义。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.