Git Product home page Git Product logo

hyperwords's Introduction

# hyperwords: Hyperparameter-Enabled Word Representations #

hyperwords is a collection of scripts and programs for creating word representations, designed to facilitate academic
research and prototyping of word representations. It allows you to tune many hyperparameters that are pre-set or
ignored in other word representation packages.

hyperwords is free and open software. If you use hyperwords in scientific publication, we would appreciate citations:  
"Improving Distributional Similarity with Lessons Learned from Word Embeddings"
Omer Levy, Yoav Goldberg, and Ido Dagan. TACL 2015.


## Requirements ##
Running hyperwords may require a lot of computational resources:  
- disk space for independently pre-processing the corpus  
- internal memory for loading sparse matrices  
- significant running time; hyperwords is neither optimized nor multi-threaded

hyperwords assumes a *nix shell, and requires Python 2.7 (or later, excluding 3+) with the following packages installed:
numpy, scipy, sparsesvd.


## Quick-Start ##
1. Download the latest version from BitBucket, unzip, and make sure all scripts have running permissions (chmod 755 *.sh).
2. Download a text corpus of your choice.
3. To create word vectors...
    * ...with SVD over PPMI, use: *corpus2svd.sh*
    * ...with SGNS (skip-grams with negative sampling), use: *corpus2sgns.sh*
4. The vectors should be available in textual format under <output_path>/vectors.txt

To explore the list of hyperparameters, use the *-h* or *--help* option.


##Pipeline##
The following figure shows the hyperwords' pipeline:

**DATA:**  raw corpus  =>  corpus  =>  pairs  =>  counts  =>  vocab  
**TRADITIONAL:**  counts + vocab  =>  pmi  =>  svd  
**EMBEDDINGS:**  pairs  + vocab  =>  sgns  

**raw corpus  =>  corpus**  
- *scripts/clean_corpus.sh*
- Eliminates non-alphanumeric tokens from the original corpus.

**corpus  =>  pairs**  
- *corpus2pairs.py*  
- Extracts a collection of word-context pairs from the corpus.

**pairs  =>  counts**  
- *scripts/pairs2counts.sh*
- Aggregates identical word-context pairs.

**counts  =>  vocab**  
- *counts2vocab.py*  
- Creates vocabularies with the words' and contexts' unigram distributions.

**counts + vocab  =>  pmi**  
- *counts2pmi.py*  
- Creates a PMI matrix (*scipy.sparse.csr_matrix*) from the counts.

**pmi  =>  svd**  
- *pmi2svd.py*  
- Factorizes the PMI matrix using SVD. Saves the result as three dense numpy matrices.

**pairs  + vocab  =>  sgns**  
- *word2vecf/word2vecf*
- An external program for creating embeddings with SGNS. For more information, see:  
**"Dependency-Based Word Embeddings". Omer Levy and Yoav Goldberg. ACL 2014.**

An example pipeline is demonstrated in: *example_test.sh*


##Evaluation##
hyperwords also allows easy evaluation of word representations on two tasks: word similarity and analogies.

**Word Similarity**
- *hyperwords/ws_eval.py*
- Compares how a representation ranks pairs of related words by similarity versus human ranking.  
- 5 readily-available datasets

**Analogies**  
- *hyperwords/analogy_eval.py*
- Solves analogy questions, such as: "man is to woman as king is to...?" (answer: queen).  
- 2 readily-available datasets  
- Shows results of two analogy recovery methods: 3CosAdd and 3CosMul. For more information, see:  
**"Linguistic Regularities in Sparse and Explicit Word Representations". Omer Levy and Yoav Goldberg. CoNLL 2014.**

These programs assume that the representation was created by hyperwords, and can be loaded by
*hyperwords.representations.embedding.Embedding*. Dense vectors in textual format (such as the ones produced by word2vec
and GloVe) can be converted to hyperwords' format using *hyperwords/text2numpy.py*.

hyperwords's People

Stargazers

kadir-gunel avatar  avatar  avatar  avatar Cao_enjun avatar  avatar  avatar Michaël Benesty avatar

Watchers

Lukas Elmer avatar James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.