Git Product home page Git Product logo

semantichashingweaksupervision's Introduction

Deep Semantic Text Hashing with Weak Supervision (SIGIR'18)

Author: Suthee Chaidaroon

This is a pyTorch implementation of two models described in Deep Semantic Text Hashing with Weak Supervision.

Requirements

Python 3.6 and PyTorch 0.4.

Datasets

We use 4 datasets in this paper: 20Newsgroups, DBPedia, YahooAnswers, and AG's news. You can download the original datasets from the link provided in the paper. For your convenience, the preprocessed dataset can be downloaded from here. These datasets are bag-of-words using BM25 weighting. The k-nearest neighbors for each document in both train and test collections are also provided.

It is important to create two data folders to train the models. The first one is "data" directory where it stores all bag-of-words datasets. The second folder is "bm25" where we use to save all the k-nearest neighbors data.

Run the program

We provided 3 models in this repo: VDSH[1], NbrReg, and NbrReg+Doc. To train the model, use the following commands:

To train NbrReg model:

python train_NbrReg.py -g 0 -b 32 -d ng20 --epoch 30 --batch_size 100

To train NbrReg+Doc model:

python train_NbrRegDoc.py -g 0 -b 32 -d ng20 --epoch 30 --batch_size 100

To train VDSH model:

python train_VDSH.py -g 0 -b 32 -d ng20 --epoch 30 --batch_size 100

Custom datasets

If you are interested in training our models on your custom datasets, you need to ensure that the dataset is in a bag-of-words format. You also need to generate a k-nearest neighbors file by running:

To create kNN for a train set:

python topK.py -d your_custom_dataset -g 0 --use_train

To create kNN for a test set:

python topK.py -d your_custom_dataset -g 0

Bibtex

@inproceedings{Chaidaroon:2018:DST:3209978.3210090,
 author = {Chaidaroon, Suthee and Ebesu, Travis and Fang, Yi},
 title = {Deep Semantic Text Hashing with Weak Supervision},
 booktitle = {The 41st International ACM SIGIR Conference on Research \&\#38; Development in Information Retrieval},
 series = {SIGIR '18},
 year = {2018},
 isbn = {978-1-4503-5657-2},
 location = {Ann Arbor, MI, USA},
 pages = {1109--1112},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/3209978.3210090},
 doi = {10.1145/3209978.3210090},
 acmid = {3210090},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {semantic hashing, variational autoencoder, weak supervision},
} 

References

[1] Chaidaroon, Suthee, and Yi Fang. "Variational deep semantic hashing for text documents." Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2017.

semantichashingweaksupervision's People

Contributors

unsuthee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

semantichashingweaksupervision's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.