Git Product home page Git Product logo

gensim-doc2vec-spark's Introduction

Gensim Doc2Vec on Spark

Overview

Implements a prototype for running gensim Doc2Vec on Spark. Only PV-DBOW with negative sampling is implemented. This work is inspired by:

https://github.com/dirkneumann/deepdist

https://github.com/klb3713/sentence2ve

General Idea

Most ML models in Spark's MLLib only palatalizes training process, while keeping model parameters in driver program and broadcast to workers. This works for models which themselves are small enough to hold in memory on a single node, while training set can be large and has to be parallelized as RDDs. However, Doc2Vec models are not of this category - number of model parameters is linear to number of points in dataset.

The goal of Doc2Vec is to learn vector representation of each document in training set. For example, a dataset of 10 million documents and vector size 300, requires 300,000,000 floating number parameters (or a 300x1000,000 array). Fortunately, each data point (a.ka. sentence or document) only updates its corresponding row in the weights matrix during training process, therefore, it's possible to parallelize the model by zipping its parameters with training dataset: each partition only holds the parameters relevant to its own share of data.

gensim is used as a basis for this setup, training for sentence vectors are adapted to work on RDDs.

Details

When training Doc2Vec in PV-DBOW model and using negative sampling, three numpy arrays are of interests in gensim, the fully captures the model state:

  1. model.syn0
  2. model.syn1neg
  3. model.docvecs.doctag_syn0

In our implementation, we keep syn0 and syn1neg centralized, as they are of limited size (size of total vocabulary). doctag_syn0 is held as RDD (each partition holds a single numpy array for it). Word2Vec model is broadcasted to all partitions.

In each training iteration, the following happens:

  1. On each partition, Cython and BLAS powered procedure train_document_dbow from gensim.models.doc2vec_inner is called, and trains word vectors, document vector and hidden layer weights jointly
  2. We record triplet (syn0 deltas, syn1neg deltas, doctag_syn0) and produce a new RDD with each partition holding a single triplet; and this new RDD is cached (as it will be used twice)
  3. We aggregate all deltas through Spark's RDD.aggregate api, to sum all deltas, then apply deltas to model object in driver program
  4. Previous generation of model broadcased is unpersisted, new model is broadcasted to all executors (runs actual training as aggregate is a Spark action)
  5. Create new inputs from new RDD in step 4, training will not be re-run as we have cached results, then we invalid stale triplet RDD from previous iteration

By tweaking num_partitions and num_iterations, we can balance the trade-off between accuracy, speed and network overhead.

Test Results

Cornell Movie Reviews Dataset is used to test the approach out. Model is trained on 5 partitions and 20 iterations, and we were able to classify movie reviews labels with about 11% error rate only from docvectors, with balanced false negative and postie rate:

*** Error Rate: 0.107995 ***
*** False Positive Rate: 0.107799 ***
*** False Negative Rate: 0.108191 ***

Running Test

The following shell script downloads movie review data and uploads it to HDFS

wget http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
tar xzvf review_polarity.tar.gz
cd txt_sentoken/
cat pos/*.txt > positive.txt
cat neg/*.txt > negative.txt
hadoop fs -mkdir -p /movie_review/positive
hadoop fs -mkdir -p /movie_review/negative
hadoop fs -put positive.txt /movie_review/positive/
hadoop fs -put negative.txt /movie_review/negative/

The following command submits testing script moview_review.py

$SPARK_HOME/bin/spark-submit --verbose \
  --master yarn    
  --deploy-mode client 
  --py-files ddoc2vecf.py
  movie_review.py

gensim-doc2vec-spark's People

Contributors

yiransheng avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.