Git Product home page Git Product logo

words2map's Introduction

How words2map derives out-of-vocabulary (OOV) vectors by searching online:
(1) Connect NLP vector database with a web search engine API like Google / Bing
(2) Do a web search on unknown words (just like a human would)
(3) Parse N-grams (e.g. N = 5) for all text from top M websites (e.g. M = 50)
(4) Filter known N-grams from pre-trained corpus (e.g. word2vec, with 3 million N-grams)
(5) Rank N-grams: inverse global frequency x local frequency on M websites (i.e. TF-IDF)
(6) Derive a new vector: sum vectors for top O known N-grams (e.g. O = 25), i.e.

(7) Visualize by reducing dimensions to 2D/3D (e.g t-SNE works, but UMAP recommended)
(8) Finally, show clusters with HDBSCAN, color-coded in a perceptually uniform space

These OOV vectors were derived in a few seconds as explained above:

See this archived blog post for more details on the words2map algorithm.

Derive new vectors for words by searching online

from words2map import *
model = load_model()
words = load_words("passions.csv")
vectors = [derive_vector(word, model) for word in words]
save_derived_vectors(words, vectors, "passions.txt")

Analyze derived word vectors

from words2map import *
from pprint import pprint
model = load_derived_vectors("passions.txt")
pprint(k_nearest_neighbors(model=model, k=10, word="Data_Scientists"))

Visualize clusters of vectors

from words2map import *
model = load_derived_vectors("passions.txt")
words = [word for word in model.vocab]
vectors = [model[word] for word in words]
vectors_in_2D = reduce_dimensionality(vectors)
generate_clusters(words, vectors_in_2D)

Install

# known broken dependencies: automatic conda installation, python 2 -> 3, gensim
# feel free to debug and make a pull request if desired
git clone https://github.com/overlap-ai/words2map.git
cd words2map
./install.sh

words2map's People

Contributors

legel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.