Git Product home page Git Product logo

wordvectors's Introduction

Pre-trained word vectors of 30+ languages

This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check this to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.

Nearing the end of the work, I happened to know that there is already a similar job named polyglot. I strongly encourage you to check this great project. How embarrassing! Nevertheless, I decided to open this project. You will know that my job has its own flavor, after all.

Requirements

  • nltk >= 1.11.1
  • regex >= 2016.6.24
  • lxml >= 3.3.3
  • numpy >= 1.11.2
  • konlpy >= 0.4.4 (Only for Korean)
  • mecab (Only for Japanese)
  • pythai >= 0.1.3 (Only for Thai)
  • pyvi >= 0.0.7.2 (Only for Vietnamese)
  • jieba >= 0.38 (Only for Chinese)
  • gensim > =0.13.1 (for Word2Vec)
  • fastText (for fasttext)

Background / References

  • Check this to know what word embedding is.
  • Check this to quickly get a picture of Word2vec.
  • Check this to install fastText.
  • Watch this to really understand what's happening under the hood of Word2vec.
  • Go get various English word vectors here if needed.

Work Flow

  • STEP 1. Download the wikipedia database backup dumps of the language you want.
  • STEP 2. Extract running texts to data/ folder.
  • STEP 3. Run build_corpus.py.
  • STEP 4-1. Run make_wordvector.sh to get Word2Vec word vectors.
  • STEP 4-2. Run fasttext.sh to get fastText word vectors.

Pre-trained models

Two types of pre-trained models are provided. w and f represent word2vec and fastText respectively.

Language ISO 639-1 Vector Size Corpus Size Vocabulary Size
Bengali (w) | Bengali (f) bn 300 147M 10059
Catalan (w) | Catalan (f) ca 300 967M 50013
Chinese (w) | Chinese (f) zh 300 1G 50101
Danish (w) | Danish (f) da 300 295M 30134
Dutch (w) | Dutch (f) nl 300 1G 50160
Esperanto (w) | Esperanto (f) eo 300 1G 50597
Finnish (w) | Finnish (f) fi 300 467M 30029
French (w) | French (f) fr 300 1G 50130
German (w) | German (f) de 300 1G 50006
Hindi (w) | Hindi (f) hi 300 323M 30393
Hungarian (w) | Hungarian (f) hu 300 692M 40122
Indonesian (w) | Indonesian (f) id 300 402M 30048
Italian (w) | Italian (f) it 300 1G 50031
Japanese (w) | Japanese (f) ja 300 1G 50108
Javanese (w) | Javanese (f) jv 100 31M 10019
Korean (w) | Korean (f) ko 200 339M 30185
Malay (w) | Malay (f) ms 100 173M 10010
Norwegian (w) | Norwegian (f) no 300 1G 50209
Norwegian Nynorsk (w) | Norwegian Nynorsk (f) nn 100 114M 10036
Polish (w) | Polish (f) pl 300 1G 50035
Portuguese (w) | Portuguese (f) pt 300 1G 50246
Russian (w) | Russian (f) ru 300 1G 50102
Spanish (w) | Spanish (f) es 300 1G 50003
Swahili (w) | Swahili (f) sw 100 24M 10222
Swedish (w) | Swedish (f) sv 300 1G 50052
Tagalog (w) | Tagalog (f) tl 100 38M 10068
Thai (w) | Thai (f) th 300 696M 30225
Turkish (w) | Turkish (f) tr 200 370M 30036
Vietnamese (w) | Vietnamese (f) vi 100 74M 10087

wordvectors's People

Contributors

kyubyong avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.