Git Product home page Git Product logo

word2vec's Introduction

word2vec 一个中文word2vec

This is a word2vec for Chinese douban movie reviews 在豆瓣电影影评上进行word2vec, 一个中文语料word2vec


Install

  • Install gcc by sudo apt-get install gcc before install Anaconda and gensim, otherwise gensim will run in slow verison(70x slower).
  • Install Anaconda
  • Install gensim

Process Corpus

  • Going to src/ cd src
  • Execute process_corpus.py python -i process_corpus.py
    • input: review_douban_movie.tsv is a file in which a douban review lays on one line.

    • output: review_douban_movie.tsv.removeword in which a processed douban review lays on one line. Some dependencies should be installed for those process flow.

    • Tradition2Simple

      • Install opencc sudo apt-get install opencc
    • WordBeark

      • Install jieba pip install jieba
    • RemoveWord

    • Then we maybe cat some files together if we split corpus into several files, using cat file1 file2 > file3 or cat file2 >> file1

Train Word2Vec

  • Going to src/ cd src
  • Execute train_word2vec.py python -i train_word2vec.py -size 300 -window 5 -sg 1 -min_count 100 -negative 10 -iter 25 -workers 12

Evaluate Word2Vec

  • Going to src/ cd src
  • Execute test_word2vec.py python -i test_word2vec.py ../data/xxx.model
  • questions-words-Zh.txt is a Chinese evaluation file translated from English questions-words.txt using Microsoft Bing Translator API. Some examples:
a b c d
写作 阅读
写作 放缓
思考 唱歌
最大 最小
最好 最高
侄子 侄女 王子 公主
侄子 侄女 儿子 女儿
雅典 希腊 曼谷 泰国
雅典 希腊 北京 **

Finetune Training

THIS FUNCTION IS DEPRECATED. It is hard to get word2vec better by Finetune rather than retrain.

  • python -i finetune.py -finetune ../model/review_douban_movie.tsv.removewordsize300window5sg1min_count100negative10iter25.model -train ../data/wv_train.data

Some usages in gensim Word2Vec

  • Get the vector of a word: model.wv["word"], i.e. model.wv["apple"]
  • Get the index of a word: model.wv.vocab["word"].index, i.e. model.wv.vocab["apple"].index
  • Get a word by its index: model.wv.index2word[index], i.e. model.wv.index2word[5585]
  • Find the most similar word of a given word: model.wv.most_similar("word"), i.e. model.wv.most_similar("apple")
  • Evaluate a model on analogies: model.wv.accuracy( 'questions-words.txt')

安装

  • 确保安装gcc,否则根据官网的说法word2vec会慢70倍。使用命令sudo apt-get install gcc
  • 安装Anaconda
  • 安装gensim

处理语料

  • 进入src文件夹
  • 执行process_corpus.py, python -i process_corpus.py
    • 其中输入为review_douban_movie.tsv,每一行有一条豆瓣影评,一共65w行

    • 输出为review_douban_movie.tsv.removeword,每一行都被繁体转简体、分词、去除标点。使用之前需要安装依赖项。

    • 繁体转简体

      • 安装opencc sudo apt-get install opencc
    • 分词

      • 安装jieba pip install jieba
    • 去除标点

    • 如果多个语料文件分开处理,最后我们可以使用Linux的cat将他们连在一起,如 cat file1 file2 > file3cat file2 >> file1

训练Word2Vec

  • 进入src文件夹
  • 执行train_word2vec.py python -i train_word2vec.py -size 300 -window 5 -sg 1 -min_count 100 -negative 10 -iter 25 -workers 12

评估Word2Vec

  • 进入src文件夹
  • 执行test_word2vec.py python -i test_word2vec.py ../data/xxx.model
  • 值得一提的是,在这一步中我贡献了中文word2vec领域第一份评估标准:data/questions-words-Zh.txt,该标准是英文版的analogies任务翻译为中文。是我在Microsoft期间使用Bing Translator API翻译,翻译质量上乘。一些例子:
a b c d
写作 阅读
写作 放缓
思考 唱歌
最大 最小
最好 最高
侄子 侄女 王子 公主
侄子 侄女 儿子 女儿
雅典 希腊 曼谷 泰国
雅典 希腊 北京 **

Finetune训练

THIS FUNCTION IS DEPRECATED. 使用Finetune并不会取得很好的效果,还不如retrain

  • python -i finetune.py -finetune ../model/review_douban_movie.tsv.removewordsize300window5sg1min_count100negative10iter25.model -train ../data/wv_train.data

一些gensim Word2Vec的使用方法

  • 获取词向量: model.wv["word"], i.e. model.wv["刘德华"]
  • 获取词索引index: model.wv.vocab["word"].index, i.e. model.wv.vocab["张艺谋"].index
  • 通过index获取词: model.wv.index2word[index], i.e. model.wv.index2word[5585]
  • 找一个词最相近的词: model.wv.most_similar("word"), i.e. model.wv.most_similar("周星驰")
  • 四词类推评价: model.wv.accuracy( 'questions-words.txt')

word2vec's People

Contributors

sefira avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.