ChineseWordVectors

搜集、整理、发布预训练中文词向量/字向量，与有志之士共同促进中文自然语言处理的发展。

使用说明

所有词向量/字向量均采用 bcolz 格式存储，如果还未安装 bcolz，请先通过 pip install bcolz 或 conda install bcolz 安装
下载相应的词向量/字向量 zip 压缩包（如 zh.64.zip），并解压到相应的目录中（如 zh.64），解压后的目录中（如 zh.64）应包含 2 个子目录 embeddings 和 words，分别存储嵌入向量和词（字）典。
参考下面的 Python 代码（兼容 Python 3 和 2），加载词向量/字向量

import bcolz

def load_embeddings(folder_path):
    """从 bcolz 加载 词/字 向量

    Args:
        - folder_path (str): 解压后的 bcolz rootdir（如 zh.64），
                             里面包含 2 个子目录 embeddings 和 words，
                             分别存储 嵌入向量 和 词（字）典

    Returns:
        - words (bcolz.carray): 词（字）典列表（bcolz carray  具有和 numpy array 类似的接口）
        - embeddings (bcolz.carray): 嵌入矩阵，每 1 行为 1 个 词向量/字向量，
                                     其行号即为该 词（字） 在 words 中的索引编号
    """
    folder_path = folder_path.rstrip('/')
    words = bcolz.carray(rootdir='%s/words'%folder_path, mode='r')
    embeddings = bcolz.carray(rootdir='%s/embeddings'%folder_path, mode='r')
    return words, embeddings

folder_path = '解压后_词向量_字向量_所在_文件夹_路径'
words, embeddings = load_embeddings(folder_path)

下载链接

百度网盘

类型、维度、词（字）典

名称	类型	维度	词汇总数	中文	英文	数字	其他	语料
facebook_cc	词	300	200w	156w	27w	13w	3.3w	Common Crawl Wikipedia
facebook_wiki	词	300	33w	14w	17w	0	1.6w	Wikipedia
sjl_weixin	词	256	35w	25w	6w	2w	0.3w	800 万微信公众号文章总词数达 650 亿
polyglot_wiki	词	64	10w	6w	3w	0	0.2w	Wikipedia
polyglot_wiki	字	64	10w	1.2w	7.8w	0	0.9w	Wikipedia
novel_baidubaike_news	词	64	611w	611w	0	0	0	小说 90G 左右百度百科 800w+ 条, 20G+ 搜狐新闻 400w+ 条, 12G+

简繁、全/半角、大小写、特殊符号

名称	简体	繁体	全角	半角	大写	小写	数字	特殊符号
facebook_cc	✔	✔	✔	✔	✔	✔	✔	句末 </s>
facebook_wiki	✔	✔	✔	✔	✘	✔	✘	句末 </s>
sjl_weixin	✔	✔	✔	✔	✔	✔	✔	换行 \n
polyglot_wiki	✔	✔	✔	✔	✔	✔	✘	未知 <UNK>, 句首 <S>, 句末 </S>, 补全 <PAD>
polyglot_wiki	✔	✔	✔	✔	✔	✔	✘	未知 <UNK>, 句首 <S>, 句末 </S>, 补全 <PAD>
novel_baidubaike_news	✔	✘	✘	✘	✘	✘	✘

作者、语料、（分词及训练）工具以及训练参数

名称	作者	分词工具	训练工具	训练参数
facebook_cc	Facebook	Stanford word segmenter	fastText	CBOW with position-weights, character n-grams of length 5 window of size 5 and 10 negatives
facebook_wiki	Facebook		fastText	论文 Enriching Word Vectors with Subword Information 中的默认设置
sjl_weixin	苏剑林	Jieba	Gensim	Skip-Gram, Huffman Softmax, 窗口大小 10, 最小词频 64, 迭代 10 次
polyglot_wiki	Polyglot		word2embeddings polyglot2
novel_baidubaike_news	dada	Jieba	Gensim	window=5, min_count=5, 其他为 Word2Vec 默认参数

mylv1222 / chinesewordvectors Goto Github PK

chinesewordvectors's Introduction

ChineseWordVectors

使用说明

下载链接

类型、维度、词（字）典

简繁、全/半角、大小写、特殊符号

作者、语料、（分词及训练）工具以及训练参数

chinesewordvectors's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

mylv1222 / chinesewordvectors Goto Github PK

chinesewordvectors's Introduction

ChineseWordVectors

使用说明

下载链接

类型、维度、词（字）典

简繁、全/半角、大小写、特殊符号

作者、语料、（分词及训练）工具 以及 训练参数

chinesewordvectors's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

作者、语料、（分词及训练）工具以及训练参数