stephantul / reach Goto Github PK

Load embeddings and featurize your sentences.

License: MIT License

Python 100.00%

reach's Introduction

reach

A light-weight package for working with pre-trained word embeddings. Useful for input into neural networks, or for doing compositional semantics.

reach can read in word vectors in word2vec or glove format without any preprocessing.

The assumption behind reach is a no-hassle approach to featurization. The vectorization and bow approaches know how to deal with OOV words, removing these problems from your code.

reach also includes nearest neighbor calculation for arbitrary vectors.

Documentation

API reference
Tutorial coming soon (see below for an example)

Installation

If you just want reach:

pip install reach

Example

import numpy as np

from reach import Reach

# Load from a .vec or .txt file
# unk_word specifies which token is the "unknown" token.
# If this is token is not in your vector space, it is added as an extra word
# and a corresponding zero vector.
# If it is in your embedding space, it is used.
r = Reach.load("path/to/embeddings", unk_word="UNK")

# Alternatively, if you have a matrix, you can directly
# input it.

# Stand-in for word embeddings
mtr = np.random.randn(8, 300)
words = ["UNK", "cat", "dog", "best", "creature", "alive", "span", "prose"]
r = Reach(mtr, words, unk_index=0)

# Get vectors through indexing.
# Throws a KeyError if a word is not present.
vector = r['cat']

# Compare two words.
similarity = r.similarity('cat', 'dog')

# Find most similar.
similarities = r.most_similar('cat', 2)

sentence = 'a dog is the best creature alive'.split()
corpus = [sentence, sentence, sentence]

# bow representation consistent with word vectors,
# for input into neural network.
bow = r.bow(sentence)

# vectorized representation.
vectorized = r.vectorize(sentence)

# can remove OOV words automatically.
vectorized = r.vectorize(sentence, remove_oov=True)

# Can mean pool out of the box.
mean = r.mean_pool(sentence)
# Automatically take care of incorrect sentences
# these are set to the vector of the UNK word, or a vector of zeros.
corpus_mean = r.mean_pool_corpus([sentence, sentence, ["not_a_word"]], remove_oov=True, safeguard=False)

# vectorize corpus.
transformed = r.transform(corpus)

# Get nearest words to arbitrary vector
nearest = r.nearest_neighbor(np.random.randn(1, 300))

# Get every word within a certain threshold
thresholded = r.threshold("cat", threshold=.0)

Loading and saving

reach has many options for saving and loading files, including custom separators, custom number of dimensions, loading a custom wordlist, custom number of words, and error recovery. One difference between gensim and reach is that reach loads both GloVe-style .vec files and regular word2vec files. Unlike gensim, reach does not support loading binary files.

benchmark

On my machine (a 2022 M1 macbook pro), we get the following times for COW BIG, a file containing about 3 million rows and 320 dimensions.

System	Time (7 loops)
Gensim	3min 57s ± 344 ms
reach	2min 14s ± 4.09 s

Fast format

reach has a special fast format, which is useful if you want to reload your word vectors often. The fast format can be created using the save_fast_format function, and loaded using the load_fast_format function. This is about equivalent to saving word vectors in gensim's own format in terms of loading speed.

License

MIT

Author

Stéphan Tulkens

reach's People

Contributors

Stargazers

Watchers

Forkers

pombredanne rpanchal1996 degerli sorenkf

reach's Issues

Spin out IO and typing to separate files

Currently everything is one big file. This is dumb!

Fix bug: adding <UNK> resets dtype

Loading a .vec file with a given dtype (e.g., float32) using an OOV UNK token silently resets the dtype to the system default (usually float64).

Small fix

Release 4.1.1

New release to update stuff on pypi

Add readthedocs, and write the docs to be read

Adding documentation is good

Switch to ruff

Switch from flake8 to ruff, update some pre-commit hooks, remove setup.cfg where possible

Switch to pyproject.toml

We're still using setup, which is dumb. Switch to pyproject.toml

OOV

What is meant with "[The] approaches know how to deal with OOV words" ?
Is the deal to delete them?

feat: remove autoreach

BYE 👋

Bump version

Nothing deprecated, so let's bump a minor version

feat: add new save format

The fast save format is a bit wonky

Add badges

Badges are cool!

Order of indices

I could look it up myself, but since I am lazy:
If I load the vectors passing wordlist, is it guaranteed that r.vectors entries will be in the same order?
Basically, what I need is the embedding matrix and a vector of words (strings) in correspondence with the words in the matrix (without having to sort again).

Also, the package in pip seems to be outdated :-p

Fix string passing in `vector_similarity`

It is still possible to pass strings into the vector_similarity function.

Move from string to hashable

We currently type everything as if all items are strings, but actually they can be any hashable. The typing and docs should be updated to reflect this.

Don't calculate `norm_vectors` if all vectors are already unit length

Even though it is an edge case, it would save 50% memory

Missing dependency: TQDM

This basic usage yields an error:

from reach import Reach

r = Reach.load("path/to/vector")

I think this package needs to add tqdm as dependency, or truly make it optional by importing it optionally.

problem loading in ipython windows 10 and Ubuntu 20.04

Hi stephan, thanks for making this package!

I've been having trouble importing it on windows 10 and Ubuntu 20.04 in jupyter notebooks: from reach import Reach gives an import error (import reach works, but using rech.Reach.load() doesn't work): unknown location yet the package is visible in the site packages on both machines, under the correct name:

I git cloned this package into my Anaconda site-packages on both windows 10 and ubuntu
I checked the python versions in ipython and powershell, they're the same in both windows 10 and ubuntu
I also checked the PATH variable in both, also the same result in both windows 10 and ubuntu
I restarted the PCs but it made no difference

I don't think this issue is related to the package versions on my machines (which probably also don't match the package requirements) because the python interpreter doesn't know where to find Reach.

Do you know of any way I could fix this locally?
Thanks anyways!

Best,
Lisa

edit: I can't import it via the python interpreter in the command line either.

feat: add support for legacy fast format

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.