Git Product home page Git Product logo

word2vec-scala's Introduction

word2vec-scala

This is a Scala implementation of the word2vec toolkit's model representation.

This Scala interface allows the user to access the vector representation output by the word2vec toolkit. It also implements example operations that can be done on the vectors (e.g., word-distance, word-analogy).

Note that it does NOT implement the actual training algorithms. You will still need to download and compile the original word2vec tool if you wish to train new models.

Includes

The included model (vectors.bin) was trained on the text8 corpus, which contains the first 100 MB of the "clean" English Wikipedia corpus. The following training parameters were used:

./word2vec -train text8 -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1

Usage

Load model

val model = new Word2Vec()
model.load("vectors.bin")

Distance - Find N best matches

val results = model.distance(List("france"), N = 10)
model.pprint(results)
                                              Word       Cosine distance
------------------------------------------------------------------------
                                           belgium              0.706633
                                             spain              0.672767
                                       netherlands              0.668178
                                             italy              0.616545
                                       switzerland              0.595572
                                        luxembourg              0.591839
                                          portugal              0.564891
                                           germany              0.549196
                                            russia              0.543569
                                           hungary              0.519036
model.pprint( model.distance(List("france", "usa")) )
                                              Word       Cosine distance
------------------------------------------------------------------------
                                       netherlands              0.691459
                                       switzerland              0.672526
                                           belgium              0.656425
                                            canada              0.641793
                                            russia              0.612469
                                                 .              .
                                                 .              .
                                                 .              .
                                           croatia              0.451900
                                            vantaa              0.450767
                                            roissy              0.448256
                                            norway              0.447392
                                              cuba              0.446168
model.pprint( model.distance(List("france", "usa", "usa")) )
                                              Word       Cosine distance
------------------------------------------------------------------------
                                            canada              0.631119
                                       switzerland              0.626366
                                       netherlands              0.621275
                                            russia              0.569951
                                           belgium              0.560368
                                                 .              .
                                                 .              .
                                                 .              .
                                             osaka              0.418143
                                               eas              0.417097
                                           antholz              0.415458
                                           fukuoka              0.414105
                                           zealand              0.413075

Analogy - King is to Queen, as Man is to ???

model.pprint( model.analogy("king", "queen", "man", N = 10) )
                                              Word       Cosine distance
------------------------------------------------------------------------
                                             woman              0.547376
                                              girl              0.509787
                                              baby              0.473137
                                            spider              0.450589
                                              love              0.433065
                                        prostitute              0.433034
                                             loves              0.422127
                                            beauty              0.421060
                                             bride              0.413417
                                              lady              0.406856

Ranking - Rank a set of words by their respective distance to search term

model.pprint( model.rank("apple", Set("orange", "soda", "lettuce")) )
                                              Word       Cosine distance
------------------------------------------------------------------------
                                            orange              0.203808
                                           lettuce              0.132007
                                              soda              0.075649

Compatibility

  • [09/2013] The code was tested to work with models trained using revision r33 of the word2vec toolkit. It should also work with future revisions, assuming that the output format does not change.

word2vec-scala's People

Contributors

trananh avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.