Git Product home page Git Product logo

text_analysis's Introduction

edu.emory.mathcs.nlp.vsm

Vector space models.

Word2Vec

  • Replication of the original c version.

  • Preserves long strings.

  • Words are deliminated by ' ' and '\n'.

  • Multiple training files

  • Preprocess data where each word is delimited by ' ' and sentence is delimited by '\n'.

#Word2Vec.java Usage #####Enviornment Details

  • Make sure not to run with the 32bit JDK, it limits heap size.
  • Other Java implementations of word2vec reccomend around 10g heap size. This implementation needs more than that (details to come).
    To set these options from maven, use:
set MAVENOPTS=-Xmx<NUM_GIGS>g -d64

where NUM_GIGS is the number of gigabytes you want to allocate for the heap.

#####Command Line parameters, (each prepended with a tag '-'), most take additional arguments

  • train: "path to the training file or the directory containig the training files.
  • output: "output file to save the resulting word vectors.
  • ext: "extension of the training files (default: "*").
  • size: "size of word vectors (default: 100).
  • window: "max-window of contextual words (default: 5).
  • sample: "threshold for occurrence of words (default: 1e-3). Those that appear with higher frequency in the training data will be randomly down-sampled.
  • negative: "number of negative examples (default: 5; common values are 3 - 10). If negative = 0, use Hierarchical Softmax instead of Negative Sampling.
  • threads: "number of threads (default: 12).
  • iter: "number of training iterations (default: 5).
  • min-count: "min-count of words (default: 5). This will discard words that appear less than times.
  • alpha: "initial learning rate (default: 0.025 for skip-gram; use 0.05 for CBOW).
  • binary: "If set, save the resulting vectors in binary moded.
  • cbow: "If set, use the continuous bag-of-words model instead of the skip-gram model.

text_analysis's People

Contributors

meera1hahn avatar jdchoi77 avatar reid-kilgore avatar ablodge avatar

Stargazers

 avatar Nipuna H Herath avatar  avatar

Watchers

James Cloos avatar  avatar  avatar  avatar tlee54 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.