Git Product home page Git Product logo

cw2vec-pytorch's Introduction

cw2vec model by Gensim and PyTorch

本repo包含是 使用Pytorch和gensim实现cw2vec模型,主要参考根据该论文“cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information ”提出的cw2vec进行实现。

note: 整体上看完该论文,基本上与Fasttext很相似,因此,我们只要根据Fasttetx进行实现。

  • version1: 基于gensim模块中的fasttext进行实现,训练速度很快
  • version2: 完全基于Pytorch框架进行实现cw2vec,虽然速度很慢,但可以理解其中的细节。

另外,pytorch实现的word2vec模型可以参考: github

structure of the model

整个模型结构如下图所示:

model

整个数据构造方式如下图所示:

2019-03-05_150834

Structure of the code

整个代码目录如下:

├── pcw2vec
|  └── callback
|  |  └── lrscheduler.py  
|  |  └── trainingmonitor.py 
|  |  └── ...
|  └── config
|  |  └── basic_config.py #a configuration file for storing model parameters
|  └── dataset   
|  └── io    
|  |  └── dataset.py  
|  |  └── data_transformer.py  
|  └── model
|  |  └── nn 
|  |  └── pretrain 
|  └── output #save the ouput of model
|  └── preprocessing #text preprocessing 
|  └── train #used for training a model
|  |  └── trainer.py 
|  |  └── ...
|  └── utils # a set of utility functions
├── train_cw2vec.py   # pytorch版本
├── get_similar_word.py # 计算相似度
├── train_gensim_cw2vec.py # gensim版本

Dependencies

  • csv
  • tqdm
  • numpy
  • pickle
  • gensim
  • scikit-learn
  • PyTorch 1.0
  • matplotlib

How to use the code

  1. 准备训练数据并放在pycw2vec/dataset/raw目录下
  2. pycw2vec/config路径下的配置文件进行相应的修改(比如数据路径。模型参数等)
  3. 运行命令python train_gensim_cw2vec.py 进行cw2vec模型训练
  4. 运行命令python get_similar_word.py 得到词对应的相似结果,默认是返回top10结果

result

<<<<<<<<<<<<<<<<<<<<<
**
** : 1.000
**区 : 0.913
我国 : 0.900
大国 : 0.899
**队 : 0.898
美国 : 0.897
韩国 : 0.893
**史 : 0.893
国 : 0.886
齐国 : 0.884
>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<<
男人
男人 : 1.000
男孩 : 0.901
男生 : 0.873
男孩子 : 0.867
女人 : 0.866
女孩 : 0.861
男女生 : 0.851
女生 : 0.840
男伴 : 0.833
男女 : 0.827
>>>>>>>>>>>>>>>>>>>>>
<<<<<<<<<<<<<<<<<<<<<
女人
女人 : 1.000
男人 : 0.866
女孩 : 0.802
女色 : 0.797
男孩 : 0.791
女生 : 0.768
男孩子 : 0.759
女孩子 : 0.750
男生 : 0.743
女性 : 0.735

note: 在实验过程,发现对于中文stroke而言,embedding size的大小影响很大,如果embedding size设置很大,那么整个模型将会偏向于stroke,从而会造成很多不合理结果。

cw2vec-pytorch's People

Contributors

greeksilverfir avatar lonepatient avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

cw2vec-pytorch's Issues

运行train_cw2vec时报错

RuntimeError: CUDA error: invalid device ordinal
报错如上,因为想进行word_similarity的任务,所以需要得到词和词向量一一对应的文件,所以要运行train_cw2vec,但是现在运行报错,服务器是拥有GPU的,希望能有人解答一下

训练出错

使用提供的raw数据和自己做的分词数据进行训练都遇到了这样的问题, 请问这个keyError的原因可能是什么?

[training] 136536/498250 [>>>>>>>> ] -0.0s/step loss: 1.4116 [2019-07-08 23:25:35]: cw2vec trainer.py[line:79] INFO saving word2vec vector
save vector: 67%|██████▋ | 8839/13259 [00:00<00:00, 17438.27it/s]
Traceback (most recent call last):
File "E:/workstation/cw2vec-pytorch/train_cw2vec.py", line 95, in
main()
File "E:/workstation/cw2vec-pytorch/train_cw2vec.py", line 76, in main
trainer.train()
File "E:\workstation\cw2vec-pytorch\pycw2vec\train\trainer.py", line 145, in train
self.save()
File "E:\workstation\cw2vec-pytorch\pycw2vec\train\trainer.py", line 87, in save
word = id_word[i]
KeyError: 9965

数据集问题

我想问下,zhihu.txt是已经分词过后的数据吗?

咨询,麻烦帮忙解答一下

你好,请问这样是已经在训练了吗
[training] 542615/15000050 [ ] -0.1s/step loss: 1.[training] 542616/15000050 [ ] -0.1s/step loss: 0.[training] 542617/15000050 [ ] -0.1s/step loss: 0.[training] 542618/15000050 [ ] -0.1s/step loss: 1.[training] 542619/15000050 [ ] -0.1s/step loss: 1.[training] 542620/15000050 [ ] -0.1s/step loss: 1.[training] 542621/15000050 [ ] -0.1s/step loss: 1.[training] 542622/15000050 [ ] -0.1s/step loss: 1.[training] 542623/15000050 [ ] -0.1s/step loss: 0.[training] 542624/15000050 [ ] -0.1s/step loss: 0.[training] 542625/15000050 [ ] -0.1s/step loss: 0.[training] 542626/15000050 [ ] -0.1s/step loss: 0.[training] 542627/15000050 [ ] -0.1s/step loss: 0.[training] 542628/15000050 [ ] -0.1s/step loss: 0.[training] 542629/15000050 [ ] -0.1s/step loss: 0.[training] 542630/15000050 [ ] -0.1s/step loss: 0.[training] 542631/15000050 [ ] -0.1s/step loss: 0.[training] 542632/15000050 [ ] -0.1s/step loss: 0.[training] 542633/15000050 [ ] -0.1s/step loss: 0.[training] 542634/15000050 [ ] -0.1s/step loss: 1.[training] 542635/15000050 [ ] -0.1s/step loss: 1.[training] 542636/15000050 [ ] -0.1s/step loss: 0.[training] 542637/15000050 [ ] -0.1s/step loss: 0.[training] 542638/15000050 [ ] -0.1s/step loss: 1.[training] 542639/15000050 [ ] -0.1s/step loss: 1.[training] 542640/15000050 [ ] -0.1s/step loss: 2.[training] 542641/15000050 [ ] -0.1s/step loss: 2.[training] 542642/15000050 [ ] -0.1s/step loss: 1.[training] 542643/15000050 [ ] -0.1s/step loss: 2.[training] 542644/15000050 [ ] -0.1s/step loss: 3.[training] 542645/15000050 [ ] -0.1s/step loss: 1.[training] 542646/15000050 [ ] -0.1s/step loss: 1.[training] 542647/15000050 [ ] -0.1s/step loss: 2.[training] 542648/15000050 [ ] -0.1s/step loss: 1.[training] 542649/15000050 [ ] -0.1s/step loss: 3.[training] 542650/15000050 [ ] -0.1s/step loss: 1.[training] 542651/15000050 [ ] -0.1s/step loss: 1.[training] 542652/15000050 [ ] -0.1s/step loss: 1.[training] 542653/15000050 [ ] -0.1s/step loss: 0.[training] 542654/15000050 [ ] -0.1s/step loss: 3.[training] 542655/15000050 [ ] -0.1s/step loss: 3.[training] 542656/15000050 [ ] -0.1s/step loss: 4.[training] 542657/15000050 [ ] -0.1s/step loss: 1.[training] 542658/15000050 [ ] -0.1s/step loss: 2.[training] 542659/15000050 [ ] -0.1s/step loss: 3.[training] 542660/15000050 [ ] -0.1s/step loss: 4.[training] 542661/15000050 [ ] -0.1s/step loss: 1.[training] 542662/15000050 [ ] -0.1s/step loss

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.