Git Product home page Git Product logo

nlp's Introduction

Natural Language Processing

License: MIT GoDoc wercker status Go Report Card

nlp

An implementation of selected machine learning algorithms for basic natural language processing in golang. The initial focus for this project is Latent Semantic Analysis to allow retrieval/searching, clustering and classification of text documents based upon semantic content.

Built upon gonum/matrix with some inspiration taken from Python's scikit-learn.

Check out the companion blog post or the go documentation page for full usage and examples.


Features

  • Sparse matrix implementations for more effective memory usage
  • Convert plain text strings into numerical feature vectors for analysis
  • Stop word removal to remove frequently occuring English words e.g. "the", "and"
  • Feature hashing implementation ('the hashing trick') (using MurmurHash3)for reduced memory requirements and reduced reliance on training data
  • TF-IDF weighting to account for frequently occuring words
  • LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
  • Cosine similarity implementation to calculate the similarity (measured in terms of difference in angles) between feature vectors.

Planned

  • Pipelining of transformations to simplify usage e.g. vectorisation -> tf-idf weighting -> truncated SVD
  • Ability to persist trained models
  • LDA (Latent Dirichlet Allocation) implementation for topic extraction
  • Stemming to treat words with common root as the same e.g. "go" and "going"
  • Querying based on multiple query strings (using their centroid) rather than just a single query string.
  • Support partitioning for the Latent Semantic Index (LSI)
  • Clustering algorithms e.g. Heirachical, K-means, etc.
  • Classification algorithms e.g. SVM, random forest, etc.

References

  1. Wikipedia
  2. Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
  3. Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
  4. Thomo, Alex. Latent Semantic Analysis (Tutorial).
  5. Latent Semantic Indexing. Standford NLP Course

nlp's People

Contributors

james-bowman avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.