embedding / chinese-word-vectors Goto Github PK

View Code? Open in Web Editor NEW

11.6K 286.0 2.3K 1.45 MB

100+ Chinese Word Vectors 上百种预训练中文词向量

License: Apache License 2.0

Python 100.00%

chinese chinese-word-segmentation embeddings word-embeddings vectors-trained embedding

chinese-word-vectors's Introduction

Chinese Word Vectors 中文词向量

中文

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks.

Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

Reference

Please cite the paper, if using these embeddings and CA8 dataset.

Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.

@InProceedings{P18-2023,
  author =  "Li, Shen
    and Zhao, Zhe
    and Hu, Renfen
    and Li, Wensi
    and Liu, Tao
    and Du, Xiaoyong",
  title =   "Analogical Reasoning on Chinese Morphological and Semantic Relations",
  booktitle =   "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
  year =  "2018",
  publisher =   "Association for Computational Linguistics",
  pages =   "138--143",
  location =  "Melbourne, Australia",
  url =   "http://aclweb.org/anthology/P18-2023"
}

A detailed analysis of the relation between the intrinsic and extrinsic evaluations of Chinese word embeddings is shown in the paper:

Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, Lijiao Yang. Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. (CCL & NLP-NABD 2018 Best Paper)

@incollection{qiu2018revisiting,
  title={Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings},
  author={Qiu, Yuanyuan and Li, Hongzheng and Li, Shen and Jiang, Yingdi and Hu, Renfen and Yang, Lijiao},
  booktitle={Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data},
  pages={209--221},
  year={2018},
  publisher={Springer}
}

Format

The pre-trained vector files are in text format. Each line contains a word and its vector. Each value is separated by space. The first line records the meta information: the first number indicates the number of words in the file and the second indicates the dimension size.

Besides dense word vectors (trained with SGNS), we also provide sparse vectors (trained with PPMI). They are in the same format with liblinear, where the number before " : " denotes dimension index and the number after the " : " denotes the value.

Pre-trained Chinese Word Vectors

Basic Settings

Window Size	Dynamic Window	Sub-sampling	Low-Frequency Word	Iteration	Negative Sampling^*
5	Yes	1e-5	10	5	5

^*Only for SGNS.

Various Domains

Chinese Word Vectors trained with different representations, context features, and corpora.

Word2vec / Skip-Gram with Negative Sampling (SGNS)
Corpus	Context Features
Corpus	Word	Word + Ngram	Word + Character	Word + Character + Ngram
Baidu Encyclopedia 百度百科	300d	300d	300d	300d / PWD: 5555
Wikipedia_zh 中文维基百科	300d	300d	300d	300d
People's Daily News 人民日报	300d	300d	300d	300d
Sogou News 搜狗新闻	300d	300d	300d	300d
Financial News 金融新闻	300d	300d	300d	300d
Zhihu_QA 知乎问答	300d	300d	300d	300d
Weibo 微博	300d	300d	300d	300d
Literature 文学作品	300d	300d / PWD: z5b4	300d	300d / PWD: yenb
Complete Library in Four Sections 四库全书^*	300d	300d	NAN	NAN
Mixed-large 综合 Baidu Netdisk / Google Drive	300d 300d	300d 300d	300d 300d	300d 300d

Positive Pointwise Mutual Information (PPMI)
Corpus	Context Features
Corpus	Word	Word + Ngram	Word + Character	Word + Character + Ngram
Baidu Encyclopedia 百度百科	Sparse	Sparse	Sparse	Sparse
Wikipedia_zh 中文维基百科	Sparse	Sparse	Sparse	Sparse
People's Daily News 人民日报	Sparse	Sparse	Sparse	Sparse
Sogou News 搜狗新闻	Sparse	Sparse	Sparse	Sparse
Financial News 金融新闻	Sparse	Sparse	Sparse	Sparse
Zhihu_QA 知乎问答	Sparse	Sparse	Sparse	Sparse
Weibo 微博	Sparse	Sparse	Sparse	Sparse
Literature 文学作品	Sparse	Sparse	Sparse	Sparse
Complete Library in Four Sections 四库全书^*	Sparse	Sparse	NAN	NAN
Mixed-large 综合	Sparse	Sparse	Sparse	Sparse

^*Character embeddings are provided, since most of Hanzi are words in the archaic Chinese.

Various Co-occurrence Information

We release word vectors upon different co-occurrence statistics. Target and context vectors are often called input and output vectors in some related papers.

In this part, one can obtain vectors of arbitrary linguistic units beyond word. For example, character vectors is in the context vectors of word-character.

All vectors are trained by SGNS on Baidu Encyclopedia.

Feature	Co-occurrence Type	Target Word Vectors	Context Word Vectors
Word	Word → Word	300d	300d
Ngram	Word → Ngram (1-2)	300d	300d
	Word → Ngram (1-3)	300d	300d
	Ngram (1-2) → Ngram (1-2)	300d	300d
Character	Word → Character (1)	300d	300d
	Word → Character (1-2)	300d	300d
	Word → Character (1-4)	300d	300d
Radical	Radical	300d	300d
Position	Word → Word (left/right)	300d	300d
Position	Word → Word (distance)	300d	300d
Global	Word → Text	300d	300d
Syntactic Feature	Word → POS	300d	300d
Syntactic Feature	Word → Dependency	300d	300d

Representations

Existing word representation methods fall into one of the two classes, dense and sparse represnetations. SGNS model (a model in word2vec toolkit) and PPMI model are respectively typical methods of these two classes. SGNS model trains low-dimensional real (dense) vectors through a shallow neural network. It is also called neural embedding method. PPMI model is a sparse bag-of-feature representation weighted by positive-pointwise-mutual-information (PPMI) weighting scheme.

Context Features

Three context features: word, ngram, and character are commonly used in the word embedding literature. Most word representation methods essentially exploit word-word co-occurrence statistics, namely using word as context feature (word feature). Inspired by language modeling problem, we introduce ngram feature into the context. Both word-word and word-ngram co-occurrence statistics are used for training (ngram feature). For Chinese, characters (Hanzi) often convey strong semantics. To this end, we consider using word-word and word-character co-occurrence statistics for learning word vectors. The length of character-level ngrams ranges from 1 to 4 (character feature).

Besides word, ngram, and character, there are other features which have substantial influence on properties of word vectors. For example, using entire text as context feature could introduce more topic information into word vectors; using dependency parse as context feature could add syntactic constraint to word vectors. 17 co-occurrence types are considered in this project.

Corpus

We made great efforts to collect corpus across various domains. All text data are preprocessed by removing html and xml tags. Only the plain text are kept and HanLP(v_1.5.3) is used for word segmentation. In addition, traditional Chinese characters are converted into simplified characters with Open Chinese Convert (OpenCC). The detailed corpora information is listed as follows:

Corpus	Size	Tokens	Vocabulary Size	Description
Baidu Encyclopedia 百度百科	4.1G	745M	5422K	Chinese Encyclopedia data from https://baike.baidu.com/
Wikipedia_zh 中文维基百科	1.3G	223M	2129K	Chinese Wikipedia data from https://dumps.wikimedia.org/
People's Daily News 人民日报	3.9G	668M	1664K	News data from People's Daily(1946-2017) http://data.people.com.cn/
Sogou News 搜狗新闻	3.7G	649M	1226K	News data provided by Sogou labs http://www.sogou.com/labs/
Financial News 金融新闻	6.2G	1055M	2785K	Financial news collected from multiple news websites
Zhihu_QA 知乎问答	2.1G	384M	1117K	Chinese QA data from https://www.zhihu.com/
Weibo 微博	0.73G	136M	850K	Chinese microblog data provided by NLPIR Lab http://www.nlpir.org/wordpress/download/weibo.7z
Literature 文学作品	0.93G	177M	702K	8599 modern Chinese literature works
Mixed-large 综合	22.6G	4037M	10653K	We build the large corpus by merging the above corpora.
Complete Library in Four Sections 四库全书	1.5G	714M	21.8K	The largest collection of texts in pre-modern China.

All words are concerned, including low frequency words.

Toolkits

All word vectors are trained by ngram2vec toolkit. Ngram2vec toolkit is a superset of word2vec and fasttext toolkit, where arbitrary context features and models are supported.

Chinese Word Analogy Benchmarks

The quality of word vectors is often evaluated by analogy question tasks. In this project, two benchmarks are exploited for evaluation. The first is CA-translated, where most analogy questions are directly translated from English benchmark. Although CA-translated has been widely used in many Chinese word embedding papers, it only contains questions of three semantic questions and covers 134 Chinese words. In contrast, CA8 is specifically designed for Chinese language. It contains 17813 analogy questions and covers comprehensive morphological and semantic relations. The CA-translated, CA8, and their detailed descriptions are provided in testsets folder.

Evaluation Toolkit

We present an evaluation toolkit in evaluation folder.

Run the following codes to evaluate dense vectors.

$ python ana_eval_dense.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_dense.py -v <vector.txt> -a CA8/semantic.txt

Run the following codes to evaluate sparse vectors.

$ python ana_eval_sparse.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_sparse.py -v <vector.txt> -a CA8/semantic.txt

chinese-word-vectors's People

Contributors

Stargazers

Watchers

Forkers

shenshen-hungry xuerenlv runngezhang jiangcy1994 wdimmy tomzhang wuchengzhu wushixong renke2 embedxj zzmjohn little1tow ludwisvan hogwartsrico brucexia6116 ryfan-rs awesome-nlp qq547276542 vinxv qiuwenbogdut icewwn allensmile xuh5156 huangxizhi mattzheng nlpformyself fendaq shizhediao topxxuki gdh756462786 jimmy-walker liu4lin wyj2046 fage2016 gongqingyi-github sidney1994 cuizhigang1989 web199195 fjibj shihuaxing chesterkuo alwayssomeone chdd liwzhi huntercmd zhishan01 yydai rogershenyc kingofoz passtion xiaomaohoujiao2 metesa sweetcard samueltt silencezjl szhaoyu 5201314wq liu-nlper roycezjq weddingcandy chanshunli cshjarry carriehua zgsxwsdxg mennianshi yuhuofei vvvictorlee yangvict joey10huawei ichi1 js641 niesz tanglifutlf allonbrooks wuyongdec 0xqq ycsuperlife webblearning huxiaoqian chuxiaokai fashtimedotcom mysqlsc huangpeng1126 onebite jakeywu tpzjj612 wordgod drrui ludybupt xyionwu mars-wei leempan dejianyang quietwoods lostmonk strrchr 1995gatch qiang841129 guanlongzhao feisan

chinese-word-vectors's Issues

感觉模型效果没有cbow好呢

在lstm中，我用word2vec比知乎问答或者mix-max综合的效果还要好，是个什么状态?

請問表格中的連結失效了??? (300d/Sparse)

Baidu Encyclopedia word embedding file has some misssing spaces

问题文件：
Word2vec / Skip-Gram with Negative Sampling (SGNS)

Baidu Encyclopedia 百度百科的 word

分别是
第269598行
第334166行
第340101行
第386099行
第387913行
第398991行
第403440行
第417792行
第440725行
第510420行
第518270行
第628803行

都是word和第一个数字连在了一起
请核对一下

Why I just find 260k vocabulary in Zhihu_QA but you said there are 1117K?

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: invalid continuation byte

使用的是sgns.merge.word词向量，python3
试了两个方法都不行

    f = open(filename,'r')
    line = f.readline().strip()
    word_dim = int(line.split(' ')[1])
    for line in f:
        row = line.strip().split(' ')
        vocab.append(row[0])
        embd.append(row[1:])
    f.close()

用f.readlines()同样错误
错误:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 3472-3473: invalid continuation byte

    with open(filename, 'rb') as f:
        line = f.readline().decode('utf-8').strip()
        word_dim = int(line.split(' ')[1])
        for line in f:
            row = line.decode('utf-8').strip().split(' ')
            vocab.append(row[0])
            embd.append(row[1:])

错误:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: invalid continuation byte

关于语料预处理

感谢作者提供的词向量~想问一下，作者在训练word2vec之前，使用HanLP对语料进行分词，使用的是哪个分词算法呢？HMM还是CRF，这对后续训练的影响大吗？另外，分词时有使用用户自定义词典吗？还有，分词后会做停用词、标点符号的过滤处理吗？非常感谢

can't use word2vec.load()?

I download (People's Daily News 人民日报, Word + Character + Ngram), and use bzip2 to decompress the file.
I want to use word2vec.load(), so I rename the file name sgns.renmin.bigram-char to sgns_renmin_bigram_char.
the error is:
ValueError: could not broadcast input array from shape (299) into shape (300)

解压之后还是出现列数不一致的问题

我用.strip().split()分割一行看他的长度，有很多都是300,301,302,303不等，应该是我没有用对分隔符，能否告知字和向量之间是用什么进行分隔的？

请问作者用linux系统读取过这些文件吗？

从windows上读取正常，上传到linux会出错，这是怎么回事，我具体的看了一下，有几十行被打乱了。

Vocabulary size和embedding中的词汇量不符

我下载了wikipedia语料的word+char模型，用gensim载入之后显示词汇量为352281，但README中的vocabulary size写的是2129K
按我的理解，两者应该是一样的。请问是我的理解有误还是数据有误？

python gensim 不能加载词向量文件

D:\Program\Anaconda3\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
Traceback (most recent call last):
File ".\zzk_word2vec.py", line 101, in
test_word_embedding('D:\data\pretrain_word2vec\Chinese-Word-Vectors\sgns.zhihu.char\sgns.zhihu.char')
File ".\zzk_word2vec.py", line 76, in test_word_embedding
model = gensim.models.KeyedVectors.load_word2vec_format(vector_file, binary=False, encoding='utf8')
File "D:\Program\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py", line 250, in load_word2vec_format
parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ")
File "D:\Program\Anaconda3\lib\site-packages\gensim\utils.py", line 242, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: invalid continuation byte

Word+Character+Ngram问题

大神，有个问题请教一下。
context feature里面的Word，Word+Ngram，Word+Character，Word+Character+Ngram有什么区别？
词向量是使用window中的词成对的训练跑出来的隐层权重矩阵。
Word+Ngram这种是怎么训练的呢？或者说Ngram模型就是个概率表，这个怎么融合进词向量训练里面的呢？还有+Character也是一样的疑问。
不明白这几个的区别，能否讲解一下，谢谢啦。

下载的词向量无法解压

就是百度网盘下载下来sgns.merge.word.bz2 正常用bzip2 -d sgns.merge.word.bz2 命令解压报错
bzip2: sgns.merge.word.bz2 is not a bzip2 file. 难道下载的问题？谢谢～～

词向量的文件为什么这么大啊

能把训练词向量的代码开源一下嘛

词向量选择的target都是word吗？只不过context是word、word+char、word+ngram、word+char+ngram

你好~感谢你们将你们的工作开源，受贵组论文启示，我想要用自己的语料库训练context为word+char+ngram的SGNS embedding。于是我又看了ngram2vec的论文，发现其根据target和context不同分为：uni_uni, uni_bi, bi_bi... 。CA8中是只用target为uni的uni_bi吗？然后又在context中加入char？如果我想训练context为word+char+ngram的SGNS embedding，如何将char加入到context呢？是要自己在ngram2vec toolkit中自己写代码添加<word,char>对嘛？

大佬，如果想获取某个字符的词嵌入，应该用什么语句啊

用这个语句可以看到词嵌入了，
with open('sgns.baidubaike.bigram-char', 'rb') as f:
for line_b in f:
line_u = line_b.decode('utf-8')
print(line_u)
但是比如我想获取“年”这个字的词嵌入，应该怎么写代码呢？
小白谢谢了。

SGNS和PPMI的word+char+ngram的CA8评估效果为什么没有在论文里写出来呢？如果有可以放出来吗？

你好，首先感谢你们所做的工作，我主观认为context里加入word+char+ngram的效果应该是最好的？但是不知道论文中为什么没有放出来在CA8的评估效果，如果可以放出与其他三种context的对比，可以放在这个issues里面吗？
再次感谢你们的工作~

Need Help...

大佬能不能提供下训练词向量的语料呢，需要怎么使用这个词向量？使用Word2Vec.load()会报错

请问是否提供out of vocabulary的词向量

一个oov的词，对应词向量里哪个token呢？

詞向量下載連結失效？

您好
詞向量下載鏈結似乎失效了
請檢查一下謝謝您

how to get the corpus like 'Financial News 金融新闻'?

如何使用这些模型呢？

作为一个NLP小白，看完README还是不知道该怎么用这些训练好的东西。

可否提供一个说明：

这些模型是什么含义，格式如何，如何读取？
提供一些可以运行的示例代码，包含加载模型，词转向量；
这么多模型，在做应用时，该如何作选择？

PPMI

想问一下，这个PPMI训练的model怎么使用，不太明白，这方面的信息有点少，可以给一些建议吗？万分感谢

原始用于训练的语料数据会公开不？

想结合自己用的一些数据一起训练，谢谢！

请问词向量文件是按照词频高低来排序的吗？

例如我只想要词频最高的前 N 个词，是否只需读取前 N 行即可？

特殊字符的词向量

词向量里面类似google空白或者回车等特殊字符的""，embeding没有啊，如果有请问是用什么代表的？

網盤不存在

您好, 點擊幾個地址之後都顯示網盤不存在, 可否更新鍵結呢? 感謝!

建议对英文词进行大小写归一化处理

如SSH，Ssh，ssh等统一处理为ssh，进一步可以考虑是否进行stemming等等

为什么ppmi.baidubaike.word文件读取的时候，每个维度的值前面都有一个"数值:"且词向量的维度远大于300。

例如：
'的 6:0.0506113688733 7:0.233081394134 8:0.435558307'
那些6、7、8是什么，我读取的方式是：
with open('/ppmi.baidubaike.word', encoding='utf-8', errors='ignore') as fr:
wd_vec = fr.readlines()
对wd_vec[56].split()后的维度一般达到1万以上。求解答，难道是我的打开方式不对？

词向量文件无法load

试了一个百科预料生成的此向量，用gensim进行load（之所以用load方式，是因为想用gensim做增量学习），gensim.models.Word2Vec.load("sgns.baidubaike.bigram-char") 报错，

请问Context Features中的Word + Character + Ngram是什么意思？

这个ngram，是指考虑了word的ngram，还是character的ngram呢？可是常出现的character的ngram，不就是word了吗？不是很理解这个context features是怎么考虑的。求解答，谢谢！

《四库全书》字向量未提供？

您好， readme 里的《四库全书》注释说提供了字向量，因为古汉语多为单字词；但是却没有提供字向量的链接，请问是不对外开放吗？还是稍后会更新 readme 文件？

谢谢~

分词词库问题

你好。非常感谢作者提供的评估语料和词向量，有些词向量的评估得分远远超过自训练的词向量，所以就想拿这些词向量做一些语义相似性的计算应用。问题来了：CA8里的一些词，Hanlp默认的词库是不包含这些词的，想通过聚合去重来合并现有的词库，但是缺少词频和词性的信息。能不能通过云盘的方式，分享一下针对百度百科语料的词库？

Question about the download links

Could you please publish a link to all of the Baidu Netdisk files? I wish to download all the model files quickly rather than one by one.
Is there any plan to save the model files to other netdisks? For example, Google Drive or Dropbox. It should be very convenient for oversea researchers.

Many thanks for your work!