Git Product home page Git Product logo

elang's Introduction

Word Embedding utilities: Indonesian Language Models

PyPI version PyPI license Activity maintained PyPI format pypi downloads Documentation Status

Elang is an acronym that combines the phrases Embedding (E) and Language (Lang) Models. Its goal is to help NLP (natural language processing) researchers, Word2Vec practitioners and data scientists be more productive in training language models. By the 0.1 release (current version, 0.0.8), the package will include ("marked" checkbox indicates a completed feature):

  • Visualizing Word2Vec models
    • 2D plot with emphasis on words of interest
    • 2d plot with neighbors of words
    • More coming soon
  • Text processing utility
    • Remove stopwords (Indonesian)
    • Remove region entity (Indonesian)
    • Remove calendar words (Indonesian)
    • Remove vulgarity (Indonesian)
  • Corpus-building utility
    • Build Indonesian corpus using wikipedia
    • Pre-trained models for quick experimentation

Elang

Elang also means "eagle" in Bahasa Indonesia, and the elang Jawa (Javan hawk-eagle) is the national bird of Indonesia, more commonly referred to as Garuda.

The package provides a collection of utility functions and tools that interface with gensim, matplotlib and scikit-learn, as well as curated negative lists for Bahasa Indonesia (kata kasar / vulgar words, stopwords etc) and useful preprocesisng functions. It abstracts away the mundane task so you can train your Word2Vec model faster, and obtain visual feedback on your model more quickly.

Quick Demo

2-d Word Embedding Visualization

Install the latest version of elang:

pip install --upgrade elang

Performing word embeddings in 2 lines of code gets you a visualization:

from elang.plot.utils import plot2d
from gensim.models import Word2Vec

model = Word2Vec.load("path.to.model")
plot2d(model)
# output:

It even looks like a soaring eagle with its outstretched wings!

Visualizing Neighbors in 2-dimensional space

elang also includes visualization methods to help you visualize a user-defined k number of neighbors to each words. When draggable is set to True, you will obtain a legend that you can move around in the resulting plot.

words = ['bca', 'hitam', 'hutan', 'pisang', 'mobil', "cinta", "pejabat", "android", "kompas"]

plotNeighbours(model, 
    words, 
    method="TSNE", 
    k=15,
    draggable=True)

The plot above plots the 15 nearest neighbors for each word in the supplied words argument. It then renders the plot with a draggable legend.

Scikit-Learn Compatability

Because the dimensionality reduction procedure is handled by the underlying sklearn code, you can use any of the valid parameters in the function call to plot2d and plotNeighbours and they will be handed off to the underlying method. Common examples are the perplexity, n_iter and random_state parameters:

model = Word2Vec.load("path.to.model")
bca = model.wv.most_similar("bca", topn=14)
similar_bca = [w[0] for w in bca]
plot2d(
    model,
    method="PCA",
    targets=similar_bca,
    perplexity=20,
    early_exaggeration=50,
    n_iter=2000,
    random_state=0,
)

Output:

elang's People

Contributors

onlyphantom avatar tomytjandra avatar abhimantramb avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.