Git Product home page Git Product logo

vgupta123 / p-sif Goto Github PK

View Code? Open in Web Editor NEW
35.0 4.0 10.0 75.57 MB

Source code for our AAAI 2020 paper P-SIF: Document Embeddings using Partition Averaging

Home Page: https://vgupta123.github.io/docs/AAAI-GuptaV.3656.pdf

Python 52.36% Perl 4.92% C 41.43% Shell 1.28%
embeddings nlp-machine-learning aaai2020 classification multi-label-classification reuters-dataset sts-dataset sparse-document-vectors document-embeddings text-classification

p-sif's Introduction

P-SIF: Document Embeddings using Partition Averaging

Introduction

  • For text classification and information retrieval tasks, text data has to be represented as a fixed dimension vector.
  • We propose simple feature construction technique named P-SIF: Document Embeddings using Partition Averaging, accepted to appear at AAAI 2020.
  • We demonstrate our method through experiments on multi-class classification on 20newsGroup dataset, multi-label text classification on Reuters-21578 dataset, Semantic Textual Similarity Tasks (STS 12-16) and other classification tasks.

Testing

There are 3 folders named 20newsGroup, Reuters, STS and other_datasets which contains code related to multi-class classification on 20newsGroup dataset, multi-label classification on Reuters dataset, Semantic Texual Similarity Task (STS) on 27 datasets, and multi-class classification on several datasets such as 20newsgroup, BBC sports, Amazon, Twitter, Classic, Reuters, and Recipe-L.

20newsGroup

Change directory to 20newsGroup for experimenting on 20newsGroup dataset and create train and test tsv files as follows:

$ cd 20newsGroup
$ python create_tsv.py

Get word vectors for all words in vocabulary:

$ python Word2Vec.py 200
# Word2Vec.py takes word vector dimension as an argument. We took it as 200.

Get Sparse Document Vectors (SCDV) for documents in train and test set and accuracy of prediction on test set:

$ python psif.py 200 40
# ksvd_sif.py takes word vector dimension and number of partitions as arguments. We took word vector dimension as 200 and number of partitions as 60.

Reuters

Change directory to Reuters for experimenting on Reuters-21578 dataset. As reuters data is in SGML format, parsing data and creating pickle file of parsed data can be done as follows:

$ python create_data.py
# We don't save train and test files locally. We split data into train and test whenever needed.

Get word vectors for all words in vocabulary:

$ python Word2Vec.py 200
# Word2Vec.py takes word vector dimension as an argument. We took it as 200.

Get Sparse Document Vectors (SCDV) for documents in train and test set:

$ python psif.py 200 40
# ksvd_sif.py takes word vector dimension and number of partitions as arguments. We took word vector dimension as 200 and number of partitions as 60.

Get performance metrics on test set:

$ python metrics.py 200 40
# metrics.py takes word vector dimension and number of partitions as arguments. We took word vector dimension as 200 and number of partitions as 60.

STS

Change directory to STS for experimenting on STS dataset. First download paragram_sl999_small.txt from John Wieting's github and keep it in STS/data folder dataset is inside SentEval folder for gmm based data partioning, parameters for cluster, weightage etc is stored in parameters_gmm.csv Create word topic vector for each word by using wordvectors from paragram_sl999_small.txt

$ python create_word_topic_gmm.py

Get similarity score for each sts dataset

$ python psif_main_gmm.py
# it will output each dataset similarity score and corresponding parameters.

for ksvd based data partioning, parameters for cluster, weightage etc is stored in parameters_ksvd.csv Create word topic vector for each word by using wordvectors from paragram_sl999_small.txt

$ python create_word_topic_ksvd.py

Get similarity score for each sts dataset

$ python psif_main_ksvd.py
# it will output each dataset similarity score and corresponding parameters.

Other_Datasets

For running P-SIF on rest of the 7 datasets, go to Other_Datasets folder. Inside other_datasets folder, each dataset has a folders with the dataset name. Follow the readme.md has been included for running the P-SIF. You have to download google embedding from here and placed in the Other_Dataset folder.

Requirements

Minimum requirements:

  • Python 2.7+
  • NumPy 1.8+
  • Scikit-learn
  • Pandas
  • Gensim

Recommended Citation

@inproceedings{gupta2020psif,
  title={P-SIF: Document Embeddings using Partition Averaging},
  author={Gupta, Vivek and Saw, Ankit and Nokhiz, Pegah and Netrapalli, Praneeth and Rai, Piyush and Talukdar, Partha},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2020}
}

Note: You neednot download 20newsGroup or Reuters-21578 dataset. All datasets are present in their respective directories. We used SGML parser for parsing Reuters-21578 dataset from here)

p-sif's People

Contributors

vgupta123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

p-sif's Issues

File "paragram_sl999_small.txt"

Hi,

The file paragram_sl999_small.txt is not available on John Weiting's Github. Could you suggest where I could get it.

Also if I were to run the STS task on a custom set of documents what should it's format be? I was hoping going through https://github.com/jwieting/iclr2016 would help me understand the better.

Help would be appreciated.

Thanks

STS on custom datasets

Hi,

I wanted to run P-SIF on some custom text-based datasets that I have. How should I do that? Because currently it only works on the SentEval datasets and that too which have been processed and available in the form as in the file parameters_gmm.csv.

So how do I run it on my own text-based datasets?

Thanks

"ValueError: The number of atoms cannot be more than the number of feature"

Hi, I am trying to use the ApproximateKSVD class and fit with a matrix of shape 5000x2048.

aksvd = ApproximateKSVD(n_components=10)
aksvd.fit(X)

But unfortunately, I get the following error:
"ValueError: The number of atoms cannot be more than the number of feature"

Anything I might be doing wrong?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.