Git Product home page Git Product logo

genvector's Introduction

GenVector

Introduction

This repo contains the code and the datasets used in the following paper:

Multi-Modal Bayesian Embeddings for Learning Social Knowledge Graphs Zhilin Yang, Jie Tang, William W. Cohen (IJCAI 2016)

We provide an efficient implementation of GenVector, a multi-modal Bayesian embedding model, which learns a shared latent topic space to generate embeddings of two modalities---social network users and knowledge concepts. GenVector combines the advantages of topic modeling and embeddings, and outperforms state-of-the-art methods in the task of learning social knowledge graphs. Our algorithm is deployed on AMiner.

Datasets

The dataset can be downloaded here (8.3G). Please extract the compressed file and put the directory data as an immediate sub-directory of genvector (the current directory).

wget https://static.aminer.org/lab-datasets/genvector/data.tar.gz
tar zxvf data.tar.gz

data contains the following files:

  1. gen_pair.*.out: original data extracted from AMiner. We split the file into 8 separate files, indexed from 0 to 7. Each line is formatted as follows:
<author_id>;<keyword_1>,<word_cnt_1>;<keyword_2>,<word_cnt_2>;...

which means that the author with author_id has publications where <keyword_i> occurs <word_cnt_i> times. This file is a (small) subset of AMiner.

  1. homepage_test.txt: the ground truth file for the homepage matching experiment. Each line is formated as follows:
<author_id>,<keyword_1>,<keyword_2>,...

The keywords following an author_id are the research interests of the author.

  1. lk_test.txt: the ground truth file for the LinkedIn matching experiment. Each line is formated as follows:
<author_id>,<keyword_1>,<keyword_2>,...

The keywords following an author_id are the research interests of the author.

  1. sample_id.txt: the file that contains the samples that we use to train our model. Each line is an author_id.

  2. keyword.model*: keyword embeddings trained on AMiner. The embeddings are stored in the gensim format. The following script gives an example of loading and using the embeddings:

import gensim
model = gensim.models.Word2Vec.load('data/keyword.model') # load the model
keyword = 'machine_learning'
embedding = model[keyword] # access the embedding of machine_learning
sim = model.similarity('machine_learning', 'data_mining') # compute the similarity between data_mining and machine_learning
  1. online.author_word.model*: author embeddings trained on AMiner, in the same format as the keyword embeddings. Author embeddings can be accessed using author_id.
import gensim
author_id = '53f42f36dabfaedce54dcd0c' # Jiawei Han's author_id on AMiner
model = gensim.models.Word2Vec.load('data/online.author_word.model')
embedding = model[author_id]

Preprocessing

If you would like to develop and test your method on our datasets, you can skip this section. However, if you would like to run the GenVector model, you need to follow the instructions in this section to preprocess the data into a specific format.

You can directly download the preprocessed data files that are prepared for our model here (110M). Please extract the compressed file and put the directory my_data as an immediate sub-directory of genvector (the current directory).

wget https://static.aminer.org/lab-datasets/genvector/my_data.tar.gz
tar zxvf my_data.tar.gz

Otherwise, you can run the script prep_data.py to do the preprocessing. Note that prep_data.py has dependencies on gensim. Running prep_data.py can take up to 25GB memory and 30 minutes. (Therefore directly downloading the above files is suggested.)

Training

Once you follow the above instructions and have the directories data and my_data in place, you are ready to train the model. Compile and run

make
./main

This will produce a file model.save.txt in the directory my_data, which contains the saved model parameters.

Please refer to the documentation inside model.hpp if you would like to tune the hyper-parameters.

Inference

After the model training is done, you can use the saved model for inference. Make sure you have my_data/model.save.txt in place before you do inference. Compile and run

make
./predict

This will produce a file model.predict.txt in the directory my_data, which contains the prediction results given by the model. The format of each line is as follows

<author_id>,<keyword_1>,<keyword_2>,...

The keywords are sorted from the highest probabilities to the lowest; i.e., our model "thinks" <keyword_1> is more likely to be a research interest than <keyword_2> for the given author_id.

Evaluation

Two evaluation scripts are in the directory eval. Assume that my_data/model.predict.txt and the directory data are ready. Run and compute the scores

cd eval
python eval_homepage.py
python eval_lk.py

Misc

  1. The author_id's we use in the data are consistent with AMiner. You can access the author's profile on AMiner with https://aminer.org/profile/<author_id>

  2. The input/output file configuration is done in config.hpp.

  3. Our implementation leverages fastapprox for efficient computation of log, pow, and exp.

  4. Our code needs further documentation and command line arguments support. We will update the code repo in the near future. You are also welcome to contribute to the code base :)

  5. Star our repo and/or cite our paper if you find it useful :)

genvector's People

Contributors

kimiyoung avatar

Watchers

James Cloos avatar Ya Wen avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.